设计数据密集型应用程序

Designing Data-Intensive Applications

可靠、可扩展和可维护系统背后的伟大理念

The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

马丁·克莱普曼

Martin Kleppmann

设计数据密集型应用程序

Designing Data-Intensive Applications

作者: 马丁 ·克莱普曼

by Martin Kleppmann

美国印刷。

Printed in the United States of America.

由O'Reilly Media, Inc. 出版 ,地址:1005 Gravenstein Highway North, Sebastopol, CA 95472。

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

购买 O'Reilly 书籍可用于教育、商业或促销用途。大多数游戏也提供在线版本 ( http://oreilly.com/safari )。欲了解更多信息,请联系我们的企业/机构销售部门:800-998-9938或 corporate@oreilly.com

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

编辑: 安·斯宾塞和玛丽·博古罗

Editors: Ann Spencer and Marie Beaugureau

索引器: Ellen Troutman-Zaig

Indexer: Ellen Troutman-Zaig

制作编辑:克里斯汀·布朗

Production Editor: Kristen Brown

室内设计师:大卫·富塔托

Interior Designer: David Futato

文案编辑:雷切尔·海德

Copyeditor: Rachel Head

封面设计:凯伦·蒙哥马利

Cover Designer: Karen Montgomery

校对:阿曼达·克西

Proofreader: Amanda Kersey

插画师:丽贝卡·德马雷斯特

Illustrator: Rebecca Demarest

  • 2017 年 3 月: 第一版
  • March 2017: First Edition

第一版的修订历史

Revision History for the First Edition

  • 2017-03-01: 首次发布
  • 2017-03-01: First Release

有关发布详细信息,请参阅 http://oreilly.com/catalog/errata.csp?isbn=9781449373320

See http://oreilly.com/catalog/errata.csp?isbn=9781449373320 for release details.

奉献精神

Dedication

技术是我们社会的强大力量。数据、软件和通信可能被用来做坏事:巩固不公平的权力结构、破坏人权和保护既得利益。但它们也可以用于好的方面:让弱势群体的声音被听到,为每个人创造机会,并避免灾难。谨以此书献给每一个为行善而努力的人。

Technology is a powerful force in our society. Data, software, and communication can be used for bad: to entrench unfair power structures, to undermine human rights, and to protect vested interests. But they can also be used for good: to make underrepresented people’s voices heard, to create opportunities for everyone, and to avert disasters. This book is dedicated to everyone working toward the good.

计算是流行文化。[……]流行文化蔑视历史。流行文化就是关于身份和参与感。它与合作、过去或未来无关——它活在当下。我想大多数为了钱而写代码的人也是如此。他们不知道[他们的文化来自哪里]。

艾伦·凯 (Alan Kay ) 接受《多布博士杂志》采访(2012)

Computing is pop culture. […] Pop culture holds a disdain for history. Pop culture is all about identity and feeling like you’re participating. It has nothing to do with cooperation, the past or the future—it’s living in the present. I think the same is true of most people who write code for money. They have no idea where [their culture came from].

Alan Kay, in interview with Dr Dobb’s Journal (2012)

前言

Preface

如果您近年来从事软件工程工作,特别是在服务器端和后端系统方面,您可能已经被大量与数据存储和处理相关的流行语轰炸。NoSQL!大数据!网络规模!分片!最终一致性!酸!CAP定理!云服务!映射减少!即时的!

If you have worked in software engineering in recent years, especially in server-side and backend systems, you have probably been bombarded with a plethora of buzzwords relating to storage and processing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time!

在过去的十年中,我们看到了数据库、分布式系统以及在它们之上构建应用程序的方式方面许多有趣的发展。这些发展有多种驱动力:

In the last decade we have seen many interesting developments in databases, in distributed systems, and in the ways we build applications on top of them. There are various driving forces for these developments:

  • 谷歌、雅虎、亚马逊、Facebook、LinkedIn、微软和 Twitter 等互联网公司正在处理大量的数据和流量,迫使他们创建新的工具,使他们能够有效地处理如此大规模的数据。

  • Internet companies such as Google, Yahoo!, Amazon, Facebook, LinkedIn, Microsoft, and Twitter are handling huge volumes of data and traffic, forcing them to create new tools that enable them to efficiently handle such scale.

  • 企业需要保持敏捷,以低廉的成本测试假设,并通过保持较短的开发周期和灵活的数据模型来快速响应新的市场洞察。

  • Businesses need to be agile, test hypotheses cheaply, and respond quickly to new market insights by keeping development cycles short and data models flexible.

  • 免费和开源软件已经变得非常成功,现在在许多环境中比商业或定制的内部软件更受青睐。

  • Free and open source software has become very successful and is now preferred to commercial or bespoke in-house software in many environments.

  • CPU 时钟速度几乎没有增加,但多核处理器已成为标准,并且网络变得越来越快。这意味着并行性只会增加。

  • CPU clock speeds are barely increasing, but multi-core processors are standard, and networks are getting faster. This means parallelism is only going to increase.

  • 即使您在一个小团队中工作,现在也可以构建分布在多台计算机甚至多个地理区域的系统,这要归功于 Amazon Web Services 等基础设施即服务 (IaaS)。

  • Even if you work on a small team, you can now build systems that are distributed across many machines and even multiple geographic regions, thanks to infrastructure as a service (IaaS) such as Amazon Web Services.

  • 现在预计许多服务将具有高可用性;由于停电或维护而导致的长时间停机变得越来越不可接受。

  • Many services are now expected to be highly available; extended downtime due to outages or maintenance is becoming increasingly unacceptable.

数据密集型应用程序正在利用这些技术发展来突破可能的界限。如果数据是应用程序的主要挑战(数据量、数据复杂性或数据变化的速度) ,我们将应用程序称为数据密集型应用程序,而不是计算密集型应用程序,其中 CPU 周期是瓶颈。

Data-intensive applications are pushing the boundaries of what is possible by making use of these technological developments. We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck.

帮助数据密集型应用程序存储和处理数据的工具和技术正在快速适应这些变化。新型数据库系统(“NoSQL”)已经受到广泛关注,但消息队列、缓存、搜索索引、批​​处理和流处理框架以及相关技术也非常重要。许多应用程序使用这些的某种组合。

The tools and technologies that help data-intensive applications store and process data have been rapidly adapting to these changes. New types of database systems (“NoSQL”) have been getting lots of attention, but message queues, caches, search indexes, frameworks for batch and stream processing, and related technologies are very important too. Many applications use some combination of these.

充满这个空间的流行语表明了对新可能性的热情,这是一件很棒的事情。然而,作为软件工程师和架构师,如果我们想要构建良好的应用程序,我们还需要对各种技术及其权衡有技术上的准确和精确的理解。为了理解这一点,我们必须比流行语更深入地挖掘。

The buzzwords that fill this space are a sign of enthusiasm for the new possibilities, which is a great thing. However, as software engineers and architects, we also need to have a technically accurate and precise understanding of the various technologies and their trade-offs if we want to build good applications. For that understanding, we have to dig deeper than buzzwords.

幸运的是,在技术快速变化的背后,无论您使用哪个版本的特定工具,都有一些持久的原则仍然有效。如果您了解这些原则,您就能够了解每个工具的适用范围、如何充分利用它以及如何避免其陷阱。这就是本书的用武之地。

Fortunately, behind the rapid changes in technology, there are enduring principles that remain true, no matter which version of a particular tool you are using. If you understand those principles, you’re in a position to see where each tool fits in, how to make good use of it, and how to avoid its pitfalls. That’s where this book comes in.

本书的目标是帮助您驾驭多样化且快速变化的数据处理和存储技术领域。本书不是某一特定工具的教程,也不是一本充满枯燥理论的教科书。相反,我们将研究成功的数据系统的示例:构成许多流行应用程序基础的技术,并且必须满足日常生产中的可扩展性、性能和可靠性要求。

The goal of this book is to help you navigate the diverse and fast-changing landscape of technologies for processing and storing data. This book is not a tutorial for one particular tool, nor is it a textbook full of dry theory. Instead, we will look at examples of successful data systems: technologies that form the foundation of many popular applications and that have to meet scalability, performance, and reliability requirements in production every day.

我们将深入研究这些系统的内部结构,梳理它们的关键算法,讨论它们的原理以及它们必须做出的权衡。在这个旅程中,我们将尝试找到 思考数据系统的有用方法——不仅仅是它们如何工作,还有它们为什么这样工作,以及我们需要提出哪些问题。

We will dig into the internals of those systems, tease apart their key algorithms, discuss their principles and the trade-offs they have to make. On this journey, we will try to find useful ways of thinking about data systems—not just how they work, but also why they work that way, and what questions we need to ask.

读完本书后,您将能够很好地决定哪种技术适合哪种目的,并了解如何组合工具以形成良好的应用程序架构的基础。您不会准备好从头开始构建自己的数据库存储引擎,但幸运的是,这很少有必要。然而,您将对系统在幕后所做的事情形成良好的直觉,以便您可以推理它们的行为,做出良好的设计决策,并跟踪可能出现的任何问题。

After reading this book, you will be in a great position to decide which kind of technology is appropriate for which purpose, and understand how tools can be combined to form the foundation of a good application architecture. You won’t be ready to build your own database storage engine from scratch, but fortunately that is rarely necessary. You will, however, develop a good intuition for what your systems are doing under the hood so that you can reason about their behavior, make good design decisions, and track down any problems that may arise.

谁应该读这本书?

Who Should Read This Book?

如果您开发的应用程序具有某种用于存储或处理数据的服务器/后端,并且您的应用程序使用互联网(例如,Web 应用程序、移动应用程序或连接互联网的传感器),那么本书适合您。

If you develop applications that have some kind of server/backend for storing or processing data, and your applications use the internet (e.g., web applications, mobile apps, or internet-connected sensors), then this book is for you.

本书适合热爱编码的软件工程师、软件架构师和技术经理。如果您需要对您所使用的系统的架构做出决策,例如,如果您需要选择用于解决给定问题的工具并找出如何最好地应用它们,那么它尤其重要。但即使您无法选择自己的工具,本书也将帮助您更好地了解它们的优点和缺点。

This book is for software engineers, software architects, and technical managers who love to code. It is especially relevant if you need to make decisions about the architecture of the systems you work on—for example, if you need to choose tools for solving a given problem and figure out how best to apply them. But even if you have no choice over your tools, this book will help you better understand their strengths and weaknesses.

您应该有一些构建基于 Web 的应用程序或网络服务的经验,并且应该熟悉关系数据库和 SQL。您知道的任何非关系数据库和其他数据相关工具都是额外的好处,但不是必需的。对 TCP 和 HTTP 等常见网络协议的总体了解很有帮助。您选择的编程语言或框架对本书没有影响。

You should have some experience building web-based applications or network services, and you should be familiar with relational databases and SQL. Any non-relational databases and other data-related tools you know are a bonus, but not required. A general understanding of common network protocols like TCP and HTTP is helpful. Your choice of programming language or framework makes no difference for this book.

如果您符合以下任一条件,您就会发现这本书很有价值:

If any of the following are true for you, you’ll find this book valuable:

  • 您想要了解如何使数据系统可扩展,例如,以支持拥有数百万用户的 Web 或移动应用程序。

  • You want to learn how to make data systems scalable, for example, to support web or mobile apps with millions of users.

  • 您需要使应用程序具有高可用性(最大限度地减少停机时间)和操作稳健性。

  • You need to make applications highly available (minimizing downtime) and operationally robust.

  • 从长远来看,您正在寻找使系统更易于维护的方法,即使系统不断增长以及需求和技术发生变化。

  • You are looking for ways of making systems easier to maintain in the long run, even as they grow and as requirements and technologies change.

  • 您对事物的运作方式有着天生的好奇心,并且想知道主要网站和在线服务内部发生了什么。本书详细介绍了各种数据库和数据处理系统的内部结构,探索其设计中的聪明思维非常有趣。

  • You have a natural curiosity for the way things work and want to know what goes on inside major websites and online services. This book breaks down the internals of various databases and data processing systems, and it’s great fun to explore the bright thinking that went into their design.

有时,在讨论可扩展数据系统时,人们会发表这样的评论:“你不是谷歌或亚马逊。不用再担心规模问题,只需使用关系数据库即可。” 这句话是有道理的:构建你不需要的规模是浪费精力,并且可能会让你陷入不灵活的设计。实际上,这是一种过早优化的形式。然而,为工作选择正确的工具也很重要,不同的技术都有自己的优点和缺点。正如我们将看到的,关系数据库很重要,但并不是处理数据的最终决定。

Sometimes, when discussing scalable data systems, people make comments along the lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a relational database.” There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization. However, it’s also important to choose the right tool for the job, and different technologies each have their own strengths and weaknesses. As we shall see, relational databases are important but not the final word on dealing with data.

本书的范围

Scope of This Book

本书并不试图提供有关如何安装或使用特定软件包或 API 的详细说明,因为已经有大量关于这些内容的文档。相反,我们讨论数据系统的基本原则和权衡,并探索不同产品采取的不同设计决策。

This book does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation for those things. Instead we discuss the various principles and trade-offs that are fundamental to data systems, and we explore the different design decisions taken by different products.

在电子书版本中,我们包含了在线资源全文的链接。所有链接在发布时均经过验证,但不幸的是,由于网络的性质,链接往往会经常中断。如果您遇到损坏的链接,或者您正在阅读本书的印刷版,您可以使用搜索引擎查找参考资料。对于学术论文,您可以在 Google 学术搜索中搜索标题来查找开放获取的 PDF 文件。或者,您可以在https://github.com/ept/ddia-references找到所有参考资料,我们在其中维护最新的链接。

In the ebook editions we have included links to the full text of online resources. All links were verified at the time of publication, but unfortunately links tend to break frequently due to the nature of the web. If you come across a broken link, or if you are reading a print copy of this book, you can look up references using a search engine. For academic papers, you can search for the title in Google Scholar to find open-access PDF files. Alternatively, you can find all of the references at https://github.com/ept/ddia-references, where we maintain up-to-date links.

我们主要关注数据系统的架构以及它们集成到数据密集型应用程序中的方式。本书没有篇幅来涵盖部署、操作、安全、管理和其他领域——这些都是复杂而重要的主题,我们不会通过在本书中对它们进行肤浅的旁注来公正地对待它们。他们应该有自己的书。

We look primarily at the architecture of data systems and the ways they are integrated into data-intensive applications. This book doesn’t have space to cover deployment, operations, security, management, and other areas—those are complex and important topics, and we wouldn’t do them justice by making them superficial side notes in this book. They deserve books of their own.

本书中描述的许多技术都属于大数据流行语的范围。然而,“大数据”一词被过度使用且定义不足,以至于它在严肃的工程讨论中没有用处。本书使用了不太含糊的术语,例如单节点与分布式系统,或者在线/交互式与离线/批处理系统。

Many of the technologies described in this book fall within the realm of the Big Data buzzword. However, the term “Big Data” is so overused and underdefined that it is not useful in a serious engineering discussion. This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems.

本书偏向于自由开源软件(FOSS),因为阅读、修改和执行源代码是了解某些东西详细工作原理的好方法。开放平台还可以降低供应商锁定的风险。然而,在适当的情况下,我们也会讨论专有软件(闭源软件、软件即服务或仅在文献中描述但未公开发布的公司内部软件)。

This book has a bias toward free and open source software (FOSS), because reading, modifying, and executing source code is a great way to understand how something works in detail. Open platforms also reduce the risk of vendor lock-in. However, where appropriate, we also discuss proprietary software (closed-source software, software as a service, or companies’ in-house software that is only described in literature but not released publicly).

本书概要

Outline of This Book

本书分为三个部分:

This book is arranged into three parts:

  1. 第一部分中,我们讨论支撑数据密集型应用程序设计的基本思想。我们从第一章开始讨论我们实际想要实现的目标:可靠性、可扩展性和可维护性;我们需要如何考虑它们;以及我们如何实现这些目标。在第 2 章中,我们比较了几种不同的数据模型和查询语言,并了解它们如何适合不同的情况。在 第3章中,我们讨论存储引擎:数据库如何在磁盘上排列数据,以便我们可以有效地再次找到它。第 4 章讨论数据编码(序列化)的格式以及模式随时间的演变。

  2. In Part I, we discuss the fundamental ideas that underpin the design of data-intensive applications. We start in Chapter 1 by discussing what we’re actually trying to achieve: reliability, scalability, and maintainability; how we need to think about them; and how we can achieve them. In Chapter 2 we compare several different data models and query languages, and see how they are appropriate to different situations. In Chapter 3 we talk about storage engines: how databases arrange data on disk so that we can find it again efficiently. Chapter 4 turns to formats for data encoding (serialization) and evolution of schemas over time.

  3. 第二部分中,我们从存储在一台机器上的数据转向分布在多台机器上的数据。这对于可扩展性来说通常是必要的,但也带来了各种独特的挑战。我们首先讨论复制(第 5 章)、分区/分片(第 6 章)和事务(第 7 章)。然后我们更详细地讨论分布式系统的问题(第 8 章)以及在分布式系统中实现一致性和共识的含义(第 9 章)。

  4. In Part II, we move from data stored on one machine to data that is distributed across multiple machines. This is often necessary for scalability, but brings with it a variety of unique challenges. We first discuss replication (Chapter 5), partitioning/sharding (Chapter 6), and transactions (Chapter 7). We then go into more detail on the problems with distributed systems (Chapter 8) and what it means to achieve consistency and consensus in a distributed system (Chapter 9).

  5. 第三部分中,我们讨论从其他数据集派生一些数据集的系统。派生数据经常出现在异构系统中:当没有一个数据库可以做好所有事情时,应用程序需要集成多个不同的数据库、缓存、索引等。在 第 10 章中,我们从派生数据的批处理方法开始,并在第 11 章中以流处理为基础。最后,在第 12 章中,我们将所有内容放在一起,并讨论未来构建可靠、可扩展和可维护的应用程序的方法。

  6. In Part III, we discuss systems that derive some datasets from other datasets. Derived data often occurs in heterogeneous systems: when there is no one database that can do everything well, applications need to integrate several different databases, caches, indexes, and so on. In Chapter 10 we start with a batch processing approach to derived data, and we build upon it with stream processing in Chapter 11. Finally, in Chapter 12 we put everything together and discuss approaches for building reliable, scalable, and maintainable applications in the future.

参考文献和进一步阅读

References and Further Reading

我们在本书中讨论的大部分内容已经在其他地方以某种形式出现过——在会议演示、研究论文、博客文章、代码、错误跟踪器、邮件列表和工程民间传说中。本书总结了来自许多不同来源的最重要的思想,并在全文中包含了原始文献的链接。如果您想更深入地探索某个领域,每章末尾的参考文献都是很好的资源,并且大多数参考文献都可以在线免费获取。

Most of what we discuss in this book has already been said elsewhere in some form or another—in conference presentations, research papers, blog posts, code, bug trackers, mailing lists, and engineering folklore. This book summarizes the most important ideas from many different sources, and it includes pointers to the original literature throughout the text. The references at the end of each chapter are a great resource if you want to explore an area in more depth, and most of them are freely available online.

奥莱利野生动物园

O’Reilly Safari

笔记

Safari(以前称为 Safari Books Online)是一个面向企业、政府、教育工作者和个人的会员制培训和参考平台。

Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.

会员可以访问来自 250 多家出版商的数千本书籍、培训视频、学习路径、交互式教程和精选播放列表,包括 O'Reilly Media、Harvard Business Review、Prentice Hall Professional、Addison-Wesley Professional、Microsoft Press、Sams、Que 、Peachpit Press、Adobe、Focal Press、Cisco Press、John Wiley & Sons、Syngress、Morgan Kaufmann、IBM Redbooks、Packt、Adobe Press、FT Press、Apress、Manning、New Riders、McGraw-Hill、Jones & Bartlett 和 Course技术等等。

Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.

欲了解更多信息,请访问http://oreilly.com/safari

For more information, please visit http://oreilly.com/safari.

如何联系我们

How to Contact Us

请向出版商提出有关本书的意见和问题:

Please address comments and questions concerning this book to the publisher:

  • 奥莱利媒体公司
  • O’Reilly Media, Inc.
  • 格拉文斯坦公路北1005号
  • 1005 Gravenstein Highway North
  • 塞瓦斯托波尔, CA 95472
  • Sebastopol, CA 95472
  • 800-998-9938(美国或加拿大)
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515(国际或本地)
  • 707-829-0515 (international or local)
  • 707-829-0104(传真)
  • 707-829-0104 (fax)

我们有本书的网页,其中列出了勘误表、示例和任何其他信息。您可以通过http://bit.ly/designing-data-intense-apps访问此页面。

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/designing-data-intensive-apps.

要评论或询问有关本书的技术问题,请发送电子邮件至

To comment or ask technical questions about this book, send email to .

有关我们的书籍、课程、会议和新闻的更多信息,请访问我们的网站:http://www.oreilly.com

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

在 Facebook 上找到我们: http: //facebook.com/oreilly

Find us on Facebook: http://facebook.com/oreilly

在 Twitter 上关注我们: http: //twitter.com/oreillymedia

Follow us on Twitter: http://twitter.com/oreillymedia

在 YouTube 上观看我们的视频: http: //www.youtube.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

致谢

Acknowledgments

本书是对大量其他人的思想和知识的融合和系统化,结合了学术研究和工业实践的经验。在计算领域,我们往往会被新颖而闪亮的事物所吸引,但我认为我们可以从以前做过的事情中学到很多东西。本书引用了 800 多篇文章、博客文章、演讲、文档等,它们对我来说是宝贵的学习资源。我非常感谢本材料的作者分享他们的知识。

This book is an amalgamation and systematization of a large number of other people’s ideas and knowledge, combining experience from both academic research and industrial practice. In computing we tend to be attracted to things that are new and shiny, but I think we have a huge amount to learn from things that have been done before. This book has over 800 references to articles, blog posts, talks, documentation, and more, and they have been an invaluable learning resource for me. I am very grateful to the authors of this material for sharing their knowledge.

我还从个人谈话中学到了很多东西,这要感谢很多人花时间与我讨论想法或耐心地向我解释事情。我要特别感谢 Joe Adler、Ross Anderson、Peter Bailis、Márton Balassi、Alastair Beresford、Mark Callaghan、Mat Clayton、Patrick Collison、Sean Cribbs、Shirshanka Das、Niklas Ekström、Stephan Ewen、Alan Fekete、Gyula Fóra、卡米尔·福尼尔 / 安德烈斯·弗罗因德 / 约翰·加布特 / 塞斯·吉尔伯特 / 汤姆·哈格特 / 帕特·海兰 / 乔·海勒斯坦 / 雅各布·霍曼 / 海蒂·霍华德 / 约翰·哈格 / 朱利安·海德 / 康拉德·欧文 / 埃文·琼斯 / 弗拉维奥·琼奎拉 / 杰西卡·克尔 / 凯尔·金斯伯里 / 杰伊·克雷普斯, 卡尔·勒奇, 尼古拉斯·利奥雄, 史蒂夫·洛夫兰, 李·马拉伯恩, 内森·马兹, 凯蒂·麦卡弗里, 乔西·麦克莱伦, 克里斯托弗·米克尔约翰, 伊恩·迈耶斯, 内哈·纳赫德, 内哈·纳鲁拉, 凯茜·奥尼尔, 奥诺拉·奥尼尔,

I have also learned a lot from personal conversations, thanks to a large number of people who have taken the time to discuss ideas or patiently explain things to me. In particular, I would like to thank Joe Adler, Ross Anderson, Peter Bailis, Márton Balassi, Alastair Beresford, Mark Callaghan, Mat Clayton, Patrick Collison, Sean Cribbs, Shirshanka Das, Niklas Ekström, Stephan Ewen, Alan Fekete, Gyula Fóra, Camille Fournier, Andres Freund, John Garbutt, Seth Gilbert, Tom Haggett, Pat Helland, Joe Hellerstein, Jakob Homan, Heidi Howard, John Hugg, Julian Hyde, Conrad Irwin, Evan Jones, Flavio Junqueira, Jessica Kerr, Kyle Kingsbury, Jay Kreps, Carl Lerche, Nicolas Liochon, Steve Loughran, Lee Mallabone, Nathan Marz, Caitie McCaffrey, Josie McLellan, Christopher Meiklejohn, Ian Meyers, Neha Narkhede, Neha Narula, Cathy O’Neil, Onora O’Neill, Ludovic Orban, Zoran Perkov, Julia Powles, Chris Riccomini, Henry Robinson, David Rosenthal, Jennifer Rullmann, Matthew Sackman, Martin Scholl, Amit Sela, Gwen Shapira, Greg Spurrier, Sam Stokes, Ben Stopford, Tom Stuart, Diana Vasile, Rahul Vohra, Pete Warden, and Brett Wooldridge.

还有一些人通过审阅草稿并提供反馈,对本书的写作做出了无价的贡献。对于这些贡献,我特别感谢 Raul Agepati、Tyler Akidau、Mattias Andersson、Sasha Baranov、Veena Basavaraj、David Beyer、Jim Brikman、Paul Carey、Raul Castro Fernandez、Joseph Chow、Derek Elkins、Sam Elliott、Alexander Gallego、Mark Grover 、斯图·哈洛威、海蒂·霍华德、尼古拉·克莱普曼、斯特凡·克鲁帕、比约恩·马德森、桑德·麦、斯特凡·波德科温斯基、菲尔·波特、哈米德·拉马扎尼、萨姆·斯托克斯和本·萨默斯。当然,我对本书中任何剩余的错误或令人不快的观点承担全部责任。

Several more people have been invaluable to the writing of this book by reviewing drafts and providing feedback. For these contributions I am particularly indebted to Raul Agepati, Tyler Akidau, Mattias Andersson, Sasha Baranov, Veena Basavaraj, David Beyer, Jim Brikman, Paul Carey, Raul Castro Fernandez, Joseph Chow, Derek Elkins, Sam Elliott, Alexander Gallego, Mark Grover, Stu Halloway, Heidi Howard, Nicola Kleppmann, Stefan Kruppa, Bjorn Madsen, Sander Mak, Stefan Podkowinski, Phil Potter, Hamid Ramazani, Sam Stokes, and Ben Summers. Of course, I take all responsibility for any remaining errors or unpalatable opinions in this book.

我感谢我的编辑 Marie Beaugureau、Mike Loukides、Ann Spencer 以及 O'Reilly 的所有团队,感谢他们帮助这本书成为现实,感谢他们对我缓慢的写作和不寻常的要求的耐心。我感谢雷切尔·海德(Rachel Head)帮助我找到了正确的词语。我感谢阿拉斯泰尔·贝雷斯福德 (Alastair Beresford)、苏珊·古德休 (Susan Goodhue)、内哈·纳赫德 (Neha Narkhede) 和凯文·斯科特 (Kevin Scott),尽管他们还有其他工作要做,但他们给了我时间和自由来写作。

For helping this book become real, and for their patience with my slow writing and unusual requests, I am grateful to my editors Marie Beaugureau, Mike Loukides, Ann Spencer, and all the team at O’Reilly. For helping find the right words, I thank Rachel Head. For giving me the time and freedom to write in spite of other work commitments, I thank Alastair Beresford, Susan Goodhue, Neha Narkhede, and Kevin Scott.

特别感谢 Shabbir Diwan 和 Edie Freedman,他们非常仔细地为各章附带的地图绘制了插图。令人惊奇的是,他们采用了非常规的想法来创建地图,并使它们变得如此美丽和引人注目。

Very special thanks are due to Shabbir Diwan and Edie Freedman, who illustrated with great care the maps that accompany the chapters. It’s wonderful that they took on the unconventional idea of creating maps, and made them so beautiful and compelling.

最后,我向我的家人和朋友表达我的爱,没有他们,我不可能完成这个花了近四年的写作过程。你是最好的。

Finally, my love goes to my family and friends, without whom I would not have been able to get through this writing process that has taken almost four years. You’re the best.

第一部分:数据系统的基础

Part I. Foundations of Data Systems

前四章介绍了适用于所有数据系统的基本思想,无论是在单台机器上运行还是分布在机器集群上:

The first four chapters go through the fundamental ideas that apply to all data systems, whether running on a single machine or distributed across a cluster of machines:

  1. 第一章介绍了我们将在本书中使用的术语和方法。它研究了可靠性可扩展性可维护性等词语的实际含义,以及我们如何努力实现这些目标。

  2. Chapter 1 introduces the terminology and approach that we’re going to use throughout this book. It examines what we actually mean by words like reliability, scalability, and maintainability, and how we can try to achieve these goals.

  3. 第 2 章比较了几种不同的数据模型和查询语言——从开发人员的角度来看,这是数据库之间最明显的区别因素。我们将看到不同的模型如何适用于不同的情况。

  4. Chapter 2 compares several different data models and query languages—the most visible distinguishing factor between databases from a developer’s point of view. We will see how different models are appropriate to different situations.

  5. 第 3 章介绍存储引擎的内部结构,并研究数据库如何在磁盘上布置数据。不同的存储引擎针对不同的工作负载进行了优化,选择正确的存储引擎会对性能产生巨大影响。

  6. Chapter 3 turns to the internals of storage engines and looks at how databases lay out data on disk. Different storage engines are optimized for different workloads, and choosing the right one can have a huge effect on performance.

  7. 第 4 章比较了数据编码(序列化)的各种格式,并特别研究了它们在应用程序需求变化和模式需要随时间适应的环境中的表现。

  8. Chapter 4 compares various formats for data encoding (serialization) and especially examines how they fare in an environment where application requirements change and schemas need to adapt over time.

稍后,第二部分将转向分布式数据系统的特定问题。

Later, Part II will turn to the particular issues of distributed data systems.

第 1 章可靠、可扩展且可维护的应用程序

Chapter 1. Reliable, Scalable, and Maintainable Applications

互联网做得非常好,以至于大多数人都将其视为像太平洋一样的自然资源,而不是人造的东西。上一次如此规模的技术如此无差错是什么时候?

艾伦·凯 (Alan Kay ) 接受《多布博士杂志》采访(2012)

The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?

Alan Kay, in interview with Dr Dobb’s Journal (2012)

当今的许多应用程序都是数据密集型的,而不是计算密集型的。原始 CPU 处理能力很少成为这些应用程序的限制因素,更大的问题通常是数据量、数据的复杂性以及数据变化的速度。

Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.

数据密集型应用程序通常由提供常用功能的标准构建块构建。例如,许多应用程序需要:

A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

  • 存储数据,以便他们或其他应用程序稍后可以再次找到它(数据库

  • Store data so that they, or another application, can find it again later (databases)

  • 记住昂贵操作的结果,以加快读取速度(缓存

  • Remember the result of an expensive operation, to speed up reads (caches)

  • 允许用户按关键字搜索数据或以各种方式过滤数据(搜索索引

  • Allow users to search data by keyword or filter it in various ways (search indexes)

  • 发送消息到另一个进程,以异步方式处理(流处理

  • Send a message to another process, to be handled asynchronously (stream processing)

  • 定期处理大量积累的数据(批处理

  • Periodically crunch a large amount of accumulated data (batch processing)

如果这听起来非常明显,那只是因为这些数据系统是一个非常成功的抽象:我们一直在使用它们,而没有考虑太多。在构建应用程序时,大多数工程师不会梦想从头开始编写新的数据存储引擎,因为数据库是完成这项工作的完美工具。

If that sounds painfully obvious, that’s just because these data systems are such a successful abstraction: we use them all the time without thinking too much. When building an application, most engineers wouldn’t dream of writing a new data storage engine from scratch, because databases are a perfectly good tool for the job.

但现实并非那么简单。有许多数据库系统具有不同的特性,因为不同的应用程序有不同的要求。有多种缓存方法、多种构建搜索索引的方法等等。在构建应用程序时,我们仍然需要弄清楚哪些工具和方法最适合手头的任务。当您需要完成单个工具无法单独完成的事情时,组合使用工具可能会很困难。

But reality is not that simple. There are many database systems with different characteristics, because different applications have different requirements. There are various approaches to caching, several ways of building search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone.

本书讲述了数据系统的原理和实用性,以及如何使用它们构建数据密集型应用程序。我们将探讨不同工具的共同点、区别以及它们如何实现各自的特性。

This book is a journey through both the principles and the practicalities of data systems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics.

在本章中,我们将首先探讨我们想要实现的目标的基础知识:可靠、可扩展和可维护的数据系统。我们将澄清这些事情的含义,概述一些思考它们的方法,并回顾后面章节所需的基础知识。在接下来的章节中,我们将继续逐层讨论在处理数据密集型应用程序时需要考虑的不同设计决策。

In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable, scalable, and maintainable data systems. We’ll clarify what those things mean, outline some ways of thinking about them, and go over the basics that we will need for later chapters. In the following chapters we will continue layer by layer, looking at different design decisions that need to be considered when working on a data-intensive application.

关于数据系统的思考

Thinking About Data Systems

我们通常认为数据库、队列、缓存等是非常不同类别的工具。尽管数据库和消息队列有一些表面上的相似性(两者都存储数据一段时间),但它们具有非常不同的访问模式,这意味着不同的性能特征,因此实现也非常不同。

We typically think of databases, queues, caches, etc. as being very different categories of tools. Although a database and a message queue have some superficial similarity—both store data for some time—they have very different access patterns, which means different performance characteristics, and thus very different implementations.

那么为什么我们要把它们放在一个总称下,比如数据系统呢?

So why should we lump them all together under an umbrella term like data systems?

近年来出现了许多用于数据存储和处理的新工具。它们针对各种不同的用例进行了优化,并且不再完全适合传统类别 [ 1 ]。 例如,有些数据存储也用作消息队列(Redis),并且有些消息队列具有类似数据库的持久性保证(Apache Kafka)。类别之间的界限正在变得模糊。

Many new tools for data storage and processing have emerged in recent years. They are optimized for a variety of different use cases, and they no longer neatly fit into traditional categories [1]. For example, there are datastores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Apache Kafka). The boundaries between the categories are becoming blurred.

其次,现在越来越多的应用程序具有如此苛刻或广泛的要求,以至于单一工具不再能够满足其所有数据处理和存储需求。相反,工作被分解为可以在单个工具上高效执行的任务,并且使用应用程序代码将这些不同的工具缝合在一起。

Secondly, increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code.

例如,如果您有一个应用程序管理的缓存层(使用 Memcached 或类似的),或一个与主数据库分离的全文搜索服务器(例如 Elasticsearch 或 Solr),则通常由应用程序代码负责保留这些缓存并与主数据库同步索引。图 1-1展示了它的样子(我们将在后面的章节中详细介绍)。

For example, if you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application code’s responsibility to keep those caches and indexes in sync with the main database. Figure 1-1 gives a glimpse of what this may look like (we will go into detail in later chapters).

迪迪亚0101
图 1-1。一种可能的数据系统架构结合了多个组件

当您组合多个工具来提供服务时,服务的接口或应用程序编程接口 (API) 通常会对客户端隐藏这些实现细节。现在,您基本上已经从较小的通用组件创建了一个新的专用数据系统。您的复合数据系统可能会提供某些保证:例如,缓存将在写入时正确失效或更新,以便外部客户端看到一致的结果。您现在不仅是一名应用程序开发人员,而且还是一名数据系统设计师。

When you combine several tools in order to provide a service, the service’s interface or application programming interface (API) usually hides those implementation details from clients. Now you have essentially created a new, special-purpose data system from smaller, general-purpose components. Your composite data system may provide certain guarantees: e.g., that the cache will be correctly invalidated or updated on writes so that outside clients see consistent results. You are now not only an application developer, but also a data system designer.

如果您正在设计数据系统或服务,就会出现很多棘手的问题。即使内部出现问题,您如何确保数据保持正确和完整?即使系统的某些部分性能下降,您如何为客户提供始终如一的良好性能?如何扩展以应对负载的增加?一个好的服务 API 是什么样的?

If you are designing a data system or service, a lot of tricky questions arise. How do you ensure that the data remains correct and complete, even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for the service look like?

有许多因素可能会影响数据系统的设计,包括相关人员的技能和经验、遗留系统依赖性、交付时间范围、组织对不同类型风险的容忍度、监管限制等。这些因素取决于非常了解情况。

There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the timescale for delivery, your organization’s tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on the situation.

在本书中,我们重点关注大多数软件系统中重要的三个问题:

In this book, we focus on three concerns that are important in most software systems:

可靠性
Reliability

即使面对逆境(硬件或软件故障,甚至人为错误), 系统也应继续正确工作(以所需的性能水平执行正确的功能)。参见“可靠性”

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). See “Reliability”.

可扩展性
Scalability

随着系统的增长(数据量、流量或复杂性),应该有合理的方法来应对这种增长。请参阅“可扩展性”

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth. See “Scalability”.

可维护性
Maintainability

随着时间的推移,许多不同的人将在系统上工作(工程和运营,既维护当前的行为并使系统适应新的用例),他们都应该能够高效地工作。请参阅“可维护性”

Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively. See “Maintainability”.

这些词经常被人们随意使用,但人们并没有清楚地理解它们的含义。为了深思熟虑的工程,我们将在本章的其余部分探索思考可靠性、可扩展性和可维护性的方法。然后,在接下来的章节中,我们将介绍用于实现这些目标的各种技术、架构和算法。

These words are often cast around without a clear understanding of what they mean. In the interest of thoughtful engineering, we will spend the rest of this chapter exploring ways of thinking about reliability, scalability, and maintainability. Then, in the following chapters, we will look at various techniques, architectures, and algorithms that are used in order to achieve those goals.

可靠性

Reliability

每个人对于某事物可靠或不可靠意味着什么都有一个直观的想法。对于软件,典型的期望包括:

Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For software, typical expectations include:

  • 应用程序执行用户期望的功能。

  • The application performs the function that the user expected.

  • 它可以容忍用户犯错误或以意想不到的方式使用软件。

  • It can tolerate the user making mistakes or using the software in unexpected ways.

  • 在预期的负载和数据量下,其性能足以满足所需的用例。

  • Its performance is good enough for the required use case, under the expected load and data volume.

  • 该系统可防止任何未经授权的访问和滥用。

  • The system prevents any unauthorized access and abuse.

如果所有这些东西加在一起意味着“正常工作”,那么我们可以将可靠性大致理解为“即使出现问题,也能继续正常工作”。

If all those things together mean “working correctly,” then we can understand reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.”

可能出错的事情称为故障,而能够预见故障并能够应对故障的系统称为容错弹性系统。前一个术语有点误导:它表明我们可以让系统容忍所有可能的错误,但这在现实中是不可行的。如果整个地球(以及上面的所有服务器)都被黑洞吞噬,那么要容忍这种故障就需要在太空中托管网络——祝你好运,预算项目获得批准。因此,只有谈论容忍某些类型 的故障才有意义。

The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient. The former term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible. If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space—good luck getting that budget item approved. So it only makes sense to talk about tolerating certain types of faults.

请注意,故障与故障不同[ 2 ]。故障通常被定义为系统的一个组件偏离其规格,而故障指整个系统停止向用户提供所需的服务。不可能将故障概率降低到零;因此,通常最好设计容错机制来防止故障导致故障。在本书中,我们介绍了几种从不可靠的部件构建可靠系统的技术。

Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book we cover several techniques for building reliable systems from unreliable parts.

与直觉相反的是,在这种容错系统中,通过故意触发故障来增加故障率 是有意义的——例如,在没有警告的情况下随机终止单个进程。许多严重错误实际上是由于错误处理不当造成的 [ 3 ];通过故意引发故障,您可以确保容错机制不断得到运用和测试,这可以增加您对自然发生的故障得到正确处理的信心。Netflix Chaos Monkey [ 4 ] 就是这种方法的一个例子。

Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately—for example, by randomly killing individual processes without warning. Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. The Netflix Chaos Monkey [4] is an example of this approach.

尽管我们通常更喜欢容忍错误而不是预防错误,但在某些情况下,预防胜于治疗(例如,因为不存在治疗方法)。安全问题就是这种情况,例如:如果攻击者破坏了系统并获得了对敏感数据的访问权限,则该事件无法撤消。然而,本书主要讨论的是可以修复的故障类型,如以下各节所述。

Although we generally prefer tolerating faults over preventing faults, there are cases where prevention is better than cure (e.g., because no cure exists). This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone. However, this book mostly deals with the kinds of faults that can be cured, as described in the following sections.

硬件故障

Hardware Faults

当我们想到系统故障的原因时,我们很快就会想到硬件故障。硬盘崩溃、内存故障、电网停电、有人拔错网线。任何使用过大型数据中心的人都可以告诉您,当您拥有大量机器时,这些事情总是会发生。

When we think of causes of system failure, hardware faults quickly come to mind. Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable. Anyone who has worked with large datacenters can tell you that these things happen all the time when you have a lot of machines.

据报道,硬盘的平均无故障时间 (MTTF) 约为 10 至 50 年 [ 5 , 6 ]。因此,在拥有 10,000 个磁盘的存储集群上,我们预计平均每天会有一个磁盘死亡。

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years [5, 6]. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

我们的第一反应通常是为各个硬件组件添加冗余,以降低系统的故障率。磁盘可以设置为 RAID 配置,服务器可以具有双电源和可热插拔 CPU,数据中心可以具有电池和柴油发电机作为备用电源。当一个组件失效时,多余的组件可以取代它,同时更换损坏的组件。这种方法不能完全防止硬件问题导致故障,但它很好理解,并且通常可以使机器不间断地运行数年。

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power. When one component dies, the redundant component can take its place while the broken component is replaced. This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.

直到最近,硬件组件的冗余对于大多数应用程序来说已经足够了,因为它使得单台机器完全故障的情况相当罕见。只要您可以相当快地将备份恢复到新计算机上,在大多数应用程序中,发生故障时的停机时间就不会是灾难性的。因此,只有少数应用程序需要多机冗余,而高可用性对于这些应用程序来说是绝对必要的。

Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you can restore a backup onto a new machine fairly quickly, the downtime in case of failure is not catastrophic in most applications. Thus, multi-machine redundancy was only required by a small number of applications for which high availability was absolutely essential.

然而,随着数据量和应用程序计算需求的增加,更多的应用程序开始使用更多数量的机器,这相应地增加了硬件故障率。此外,在 Amazon Web Services (AWS) 等一些云平台中,虚拟机实例在没有警告的情况下变得不可用是相当常见的[ 7 ],因为这些平台的设计优先考虑灵活性和弹性 不是单机可靠性。

However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults. Moreover, in some cloud platforms such as Amazon Web Services (AWS) it is fairly common for virtual machine instances to become unavailable without warning [7], as the platforms are designed to prioritize flexibility and elasticityi over single-machine reliability.

因此,通过优先使用软件容错技术或在硬件冗余之外使用软件容错技术,出现了一种可以容忍整个机器丢失的系统的趋势。此类系统还具有操作优势:如果需要重新启动计算机(例如,应用操作系统安全补丁),单服务器系统需要计划停机时间,而可以容忍机器故障的系统可以一次对一个节点进行修补,整个系统不会停机(滚动升级;请参阅第 4 章)。

Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (a rolling upgrade; see Chapter 4).

软件错误

Software Errors

我们通常认为硬件故障是随机的且相互独立的:一台机器的磁盘发生故障并不意味着另一台机器的磁盘也会发生故障。可能存在弱相关性(例如由于共同原因,例如服务器机架中的温度),但否则大量硬件组件不太可能同时发生故障。

We usually think of hardware faults as being random and independent from each other: one machine’s disk failing does not imply that another machine’s disk is going to fail. There may be weak correlations (for example due to a common cause, such as the temperature in the server rack), but otherwise it is unlikely that a large number of hardware components will fail at the same time.

另一类故障是系统内的系统错误[ 8 ]。此类故障更难以预测,并且由于它们在节点之间相关,因此它们往往会比不相关的硬件故障导致更多的系统故障[ 5 ]。示例包括:

Another class of fault is a systematic error within the system [8]. Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults [5]. Examples include:

  • 一种软件错误,当输入特定的错误输入时,该错误会导致应用程序服务器的每个实例崩溃。例如,考虑 2012 年 6 月 30 日的闰秒,由于 Linux 内核中的错误,导致许多应用程序同时挂起 [ 9 ]。

  • A software bug that causes every instance of an application server to crash when given a particular bad input. For example, consider the leap second on June 30, 2012, that caused many applications to hang simultaneously due to a bug in the Linux kernel [9].

  • 耗尽某些共享资源(CPU 时间、内存、磁盘空间或网络带宽)的失控进程。

  • A runaway process that uses up some shared resource—CPU time, memory, disk space, or network bandwidth.

  • 系统所依赖的服务速度减慢、变得无响应或开始返回损坏的响应。

  • A service that the system depends on that slows down, becomes unresponsive, or starts returning corrupted responses.

  • 级联故障,其中一个组件中的小故障触发另一个组件中的故障,进而触发更多故障[ 10 ]。

  • Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers further faults [10].

导致此类软件故障的错误通常会潜伏很长一段时间,直到由一组不寻常的情况触发。在这些情况下,我们发现软件正在对其环境做出某种假设——虽然这种假设通常是正确的,但由于某种原因它最终不再正确[11 ]

The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and while that assumption is usually true, it eventually stops being true for some reason [11].

软件中的系统故障问题没有快速的解决方案。许多小事情都可以提供帮助:仔细思考系统中的假设和交互;彻底的测试;进程隔离;允许进程崩溃并重新启动;测量、监控和分析生产中的系统行为。如果系统期望提供某种保证(例如,在消息队列中,传入消息的数量等于传出消息的数量),则它可以在运行时不断检查自身,并在发现差异时发出警报[ 12 ]。

There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production. If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), it can constantly check itself while it is running and raise an alert if a discrepancy is found [12].

人为错误

Human Errors

人类设计和构建软件系统,维持系统运行的操作人员也是人类。即使他们有最好的意图,人类也被认为是不可靠的。例如,一项针对大型互联网服务的研究发现,运营商的配置错误是中断的主要原因,而硬件故障(服务器或网络)仅在 10-25% 的中断中发挥了作用 [13 ]

Humans design and build software systems, and the operators who keep the systems running are also human. Even when they have the best intentions, humans are known to be unreliable. For example, one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages [13].

尽管人类不可靠,我们如何使我们的系统可靠?最好的系统结合了多种方法:

How do we make our systems reliable, in spite of unreliable humans? The best systems combine several approaches:

  • 以尽量减少出错机会的方式设计系统。例如,精心设计的抽象、API 和管理界面可以轻松地做“正确的事情”并阻止“错误的事情”。然而,如果接口限制太多,人们就会绕过它们,从而抵消它们的好处,所以这是一个很难实现的平衡。

  • Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right.

  • 将人们最容易犯错误的地方与可能导致失败的地方分开。特别是,提供功能齐全的非生产沙箱环境,人们可以在其中使用真实数据安全地探索和实验,而不会影响真实用户。

  • Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.

  • 从单元测试到全系统集成测试和手动测试,在各个级别上进行彻底的测试[ 3 ]。自动化测试应用广泛,易于理解,对于覆盖正常操作中很少出现的极端情况尤其有价值。

  • Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests [3]. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.

  • 允许快速轻松地从人为错误中恢复,以尽量减少发生故障时的影响。例如,快速回滚配置更改,逐步推出新代码(以便任何意外错误仅影响一小部分用户),并提供重新计算数据的工具(以防旧计算不正确) )。

  • Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually (so that any unexpected bugs affect only a small subset of users), and provide tools to recompute data (in case it turns out that the old computation was incorrect).

  • 设置详细且清晰的监控,例如性能指标和错误率。在其他工程学科中,这称为遥测。(一旦火箭离开地面,遥测对于跟踪正在发生的情况和了解故障至关重要[ 14 ]。)监控可以向我们显示早期预警信号,并允许我们检查是否违反了任何假设或约束。当问题发生时,指标对于诊断问题非常有价值。

  • Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures [14].) Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue.

  • 实施良好的管理实践和培训——一个复杂而重要的方面,超出了本书的范围。

  • Implement good management practices and training—a complex and important aspect, and beyond the scope of this book.

可靠性有多重要?

How Important Is Reliability?

可靠性不仅适用于核电站和空中交通管制软件,更常见的应用也有望可靠运行。业务应用程序中的错误会导致生产力下降(如果报告的数据不正确,还会带来法律风险),而电子商务网站的中断可能会造成巨大的收入损失和声誉损害。

Reliability is not just for nuclear power stations and air traffic control software—more mundane applications are also expected to work reliably. Bugs in business applications cause lost productivity (and legal risks if figures are reported incorrectly), and outages of ecommerce sites can have huge costs in terms of lost revenue and damage to reputation.

即使在“非关键”应用程序中,我们也对用户负责。考虑一位家长将孩子的所有照片和视频存储在您的照片应用程序中 [ 15 ]。如果数据库突然损坏,他们会有什么感觉?他们知道如何从备份中恢复它吗?

Even in “noncritical” applications we have a responsibility to our users. Consider a parent who stores all their pictures and videos of their children in your photo application [15]. How would they feel if that database was suddenly corrupted? Would they know how to restore it from a backup?

在某些情况下,我们可能会选择牺牲可靠性来降低开发成本(例如,为未经证实的市场开发原型产品时)或运营成本(例如,对于利润率非常窄的服务),但我们应该当我们走捷径时要非常小心。

There are situations in which we may choose to sacrifice reliability in order to reduce development cost (e.g., when developing a prototype product for an unproven market) or operational cost (e.g., for a service with a very narrow profit margin)—but we should be very conscious of when we are cutting corners.

可扩展性

Scalability

即使系统今天可靠地工作,并不意味着它在未来也一定能可靠地工作。性能下降的一个常见原因是负载增加:也许系统已从 10,000 个并发用户增长到 100,000 个并发用户,或者从 100 万增长到 1000 万。也许它正在处理的数据量比以前大得多。

Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in the future. One common reason for degradation is increased load: perhaps the system has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is processing much larger volumes of data than it did before.

可扩展性是我们用来描述系统应对增加的负载的能力的术语。但请注意,它不是我们可以附加到系统上的一维标签:说“X 可扩展”或“Y 不可扩展”是没有意义的。相反,讨论可扩展性意味着考虑诸如“如果系统以特定方式增长,我们应对增长的选择是什么?”之类的问题。以及“我们如何添加计算资源来处理额外的负载?”

Scalability is the term we use to describe a system’s ability to cope with increased load. Note, however, that it is not a one-dimensional label that we can attach to a system: it is meaningless to say “X is scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like “If the system grows in a particular way, what are our options for coping with the growth?” and “How can we add computing resources to handle the additional load?”

描述负载

Describing Load

首先,我们需要简洁地描述系统当前的负载;只有这样我们才能讨论增长问题(如果我们的负载翻倍会发生什么?)。负载可以用一些数字来描述,我们称之为负载参数。参数的最佳选择取决于系统的架构:可能是 Web 服务器每秒的请求数、数据库中的读写比率、聊天室中同时活跃用户的数量、聊天室中的点击率。缓存,或者其他东西。也许平均情况对您来说很重要,或者您的瓶颈可能是由少数极端情况决定的。

First, we need to succinctly describe the current load on the system; only then can we discuss growth questions (what happens if our load doubles?). Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else. Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

为了使这个想法更加具体,我们以 Twitter 为例,使用 2012 年 11 月发布的数据[ 16 ]。Twitter 的两项主要业务是:

To make this idea more concrete, let’s consider Twitter as an example, using data published in November 2012 [16]. Two of Twitter’s main operations are:

发布推文
Post tweet

用户可以向其关注者发布新消息(平均 4.6k 请求/秒,峰值超过 12k 请求/秒)。

A user can publish a new message to their followers (4.6k requests/sec on average, over 12k requests/sec at peak).

主页时间表
Home timeline

用户可以查看他们关注的人发布的推文(300k 请求/秒)。

A user can view tweets posted by the people they follow (300k requests/sec).

只需处理每秒 12,000 次写入(发布推文的峰值速率)就相当容易了。然而,Twitter 的扩展挑战主要不是由于推文数量,而是由于 扇出ii — 每个用户关注很多人,每个用户又被很多人关注。实现这两个操作大致有两种方法:

Simply handling 12,000 writes per second (the peak rate for posting tweets) would be fairly easy. However, Twitter’s scaling challenge is not primarily due to tweet volume, but due to fan-outii—each user follows many people, and each user is followed by many people. There are broadly two ways of implementing these two operations:

  1. 发布推文只需将新推文插入到全局推文集合中。当用户请求他们的主页时间线时,查找他们关注的所有人员,找到每个用户的所有推文,然后合并它们(按时间排序)。在如图 1-2所示的关系数据库中 ,您可以编写如下查询:

    SELECT tweets.*, users.* FROM tweets
      JOIN users   ON tweets.sender_id    = users.id
      JOIN follows ON follows.followee_id = users.id
      WHERE follows.follower_id = current_user
  2. Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time). In a relational database like in Figure 1-2, you could write a query such as:

    SELECT tweets.*, users.* FROM tweets
      JOIN users   ON tweets.sender_id    = users.id
      JOIN follows ON follows.followee_id = users.id
      WHERE follows.follower_id = current_user
  3. 为每个用户的主页时间线维护一个缓存,就像每个接收用户的推文邮箱一样(参见图 1-3)。当用户发布推文时,查找关注该用户的所有人员,并将新推文插入到他们的每个家庭时间线缓存中。这样,读取主时间线的请求就很便宜,因为其结果已提前计算出来。

  4. Maintain a cache for each user’s home timeline—like a mailbox of tweets for each recipient user (see Figure 1-3). When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.

迪迪亚0102
图 1-2。用于实现 Twitter 主页时间线的简单关系模式。
迪迪亚0103
图 1-3。Twitter 用于向关注者发送推文的数据管道,负载参数截至 2012 年 11 月 [ 16 ]。

Twitter 的第一个版本使用方法 1,但系统难以跟上主页时间线查询的负载,因此该公司改用方法 2。这种方法效果更好,因为发布推文的平均速率几乎比主时间线读取的速率,因此在这种情况下,最好在写入时做更多的工作,而在读取时做更少的工作。

The first version of Twitter used approach 1, but the systems struggled to keep up with the load of home timeline queries, so the company switched to approach 2. This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads, and so in this case it’s preferable to do more work at write time and less at read time.

然而,方法 2 的缺点是现在发布推文需要大量额外的工作。平均而言,一条推文会发送给大约 75 个关注者,因此每秒 4.6k 条推文会变成每秒 345k 次写入主时间线缓存。但这个平均值掩盖了这样一个事实:每个用户的关注者数量差异很大,有些用户的关注者数量超过 3000 万。这意味着一条推文可能会导致超过 3000 万次写入家庭时间线!及时做到这一点(Twitter 试图在五秒内向关注者发送推文)是一项重大挑战。

However, the downside of approach 2 is that posting a tweet now requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But this average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers. This means that a single tweet may result in over 30 million writes to home timelines! Doing this in a timely manner—Twitter tries to deliver tweets to followers within five seconds—is a significant challenge.

在 Twitter 的示例中,每个用户的关注者分布(可能根据这些用户发推文的频率进行加权)是讨论可扩展性的关键负载参数,因为它决定了扇出负载。您的应用程序可能具有非常不同的特征,但您可以应用类似的原则来推理其负载。

In the example of Twitter, the distribution of followers per user (maybe weighted by how often those users tweet) is a key load parameter for discussing scalability, since it determines the fan-out load. Your application may have very different characteristics, but you can apply similar principles to reasoning about its load.

Twitter 轶事的最后一个转折点是:现在方法 2 已经得到了强有力的实施,Twitter 正在转向两种方法的混合。大多数用户的推文在发布时会继续扇出到主页时间线,但少数拥有大量关注者(即名人)的用户会被排除在这种扇出之外。用户可能关注的任何名人的推文都会单独获取,并在读取时与该用户的主页时间线合并,就像方法 1 中一样。这种混合方法能够提供始终如一的良好性能。在介绍了一些更多的技术基础之后,我们将在第 12 章中重新讨论这个例子。

The final twist of the Twitter anecdote: now that approach 2 is robustly implemented, Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance. We will revisit this example in Chapter 12 after we have covered some more technical ground.

描述性能

Describing Performance

一旦描述了系统上的负载,您就可以调查负载增加时会发生什么情况。你可以从两个方面来看待它:

Once you have described the load on your system, you can investigate what happens when the load increases. You can look at it in two ways:

  • 当您增加负载参数并保持系统资源(CPU、内存、网络带宽等)不变时,系统的性能会受到怎样的影响?

  • When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?

  • 当增加一个负载参数时,如果要保持性能不变,需要增加多少资源?

  • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

这两个问题都需要性能数据,所以让我们简要地看一下系统性能的描述。

Both questions require performance numbers, so let’s look briefly at describing the performance of a system.

在 Hadoop 这样的批处理系统中,我们通常关心吞吐量——每秒可以处理的记录数,或者在一定大小的数据集上运行作业所需的总时间。iii在在线系统中,通常更重要的是服务的 响应时间,即客户端发送请求和接收响应之间的时间。

In a batch processing system such as Hadoop, we usually care about throughput—the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.iii In online systems, what’s usually more important is the service’s response time—that is, the time between a client sending a request and receiving a response.

延迟和响应时间

Latency and response time

延迟响应时间通常作为同义词使用,但它们并不相同。响应时间是客户端看到的:除了处理请求的实际时间(服务时间)之外,还包括网络延迟和排队延迟。延迟是请求等待处理的持续时间——在此期间请求处于潜在状态,等待服务[ 17 ]。

Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service [17].

即使您只是一遍又一遍地发出相同的请求,每次尝试都会得到略有不同的响应时间。实际上,在处理各种请求的系统中,响应时间可能会有很大差异。因此,我们需要将响应时间视为 可测量的值的分布,而不是单个数字。

Even if you only make the same request over and over again, you’ll get a slightly different response time on every try. In practice, in a system handling a variety of requests, the response time can vary a lot. We therefore need to think of response time not as a single number, but as a distribution of values that you can measure.

图 1-4中,每个灰色条代表对服务的请求,其高度显示该请求花费的时间。大多数请求都相当快,但偶尔也会有异常情况需要更长的时间。也许缓慢的请求本质上更昂贵,例如,因为它们处理更多的数据。但即使在您认为所有请求应该花费相同时间的情况下,您也会遇到变化:上下文切换到后台进程、网络数据包丢失和 TCP 重传、垃圾收集可能会引入随机的额外延迟暂停、强制从磁盘读取的页错误、服务器机架中的机械振动 [ 18 ] 或许多其他原因。

In Figure 1-4, each gray bar represents a request to a service, and its height shows how long that request took. Most requests are reasonably fast, but there are occasional outliers that take much longer. Perhaps the slow requests are intrinsically more expensive, e.g., because they process more data. But even in a scenario where you’d think all requests should take the same time, you get variation: random additional latency could be introduced by a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [18], or many other causes.

迪迪亚0104
图 1-4。说明平均值和百分位数:对服务的 100 个请求样本的响应时间。

报告服务的平均响应时间 是很常见的。(严格来说,术语“平均”并不指任何特定的公式,但在实践中,它通常被理解为算术平均值 给定n 个值,将所有值相加,然后除以n。)但是,平均值如果您想知道“典型”响应时间,这不是一个很好的指标,因为它并不能告诉您有多少用户实际经历了这种延迟。

It’s common to see the average response time of a service reported. (Strictly speaking, the term “average” doesn’t refer to any particular formula, but in practice it is usually understood as the arithmetic mean: given n values, add up all the values, and divide by n.) However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay.

通常最好使用百分位数。如果您获取响应时间列表并将其从最快到最慢排序,那么中位数就是中间点:例如,如果您的中位数响应时间是 200 毫秒,这意味着一半的请求在不到 200 毫秒的时间内返回,一半的请求在 200 毫秒内返回。您的请求需要比这更长的时间。

Usually it is better to use percentiles. If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point: for example, if your median response time is 200 ms, that means half your requests return in less than 200 ms, and half your requests take longer than that.

如果您想知道用户通常需要等待多长时间,这使得中位数成为一个很好的指标:一半的用户请求在不到中位数响应时间的时间内得到满足,而另一半则需要比中位数更长的时间。中位数也称为第 50 个百分位数,有时缩写为p50。注意,中位数指的是单个请求;如果用户发出多个请求(在一个会话过程中,或者因为单个页面中包含多个资源),则至少其中一个请求比中值慢的概率远大于 50%。

This makes the median a good metric if you want to know how long users typically have to wait: half of user requests are served in less than the median response time, and the other half take longer than the median. The median is also known as the 50th percentile, and sometimes abbreviated as p50. Note that the median refers to a single request; if the user makes several requests (over the course of a session, or because several resources are included in a single page), the probability that at least one of them is slower than the median is much greater than 50%.

为了弄清楚异常值有多严重,您可以查看较高的百分位数:第95、99 99.9百分位数很常见(缩写为p95p99p999)。它们是响应时间阈值,在该阈值下,95%、99% 或 99.9% 的请求比该特定阈值快。例如,如果第 95 个百分位响应时间为 1.5 秒,则意味着 100 个请求中有 95 个花费的时间少于 1.5 秒,而 100 个请求中有 5 个花费的时间为 1.5 秒或更长。如图 1-4所示。

In order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th, and 99.9th percentiles are common (abbreviated p95, p99, and p999). They are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is illustrated in Figure 1-4.

高百分比的响应时间(也称为尾部延迟)非常重要,因为它们直接影响用户的服务体验。例如,亚马逊用 99.9% 的百分位来描述内部服务的响应时间要求,尽管它只影响千分之一的请求。这是因为请求最慢的客户通常是那些帐户上拥有最多数据的客户,因为他们进行了多次购买,也就是说,他们是最有价值的客户 [19 ]。通过确保网站对他们来说速度快来让这些客户满意非常重要:亚马逊还观察到,响应时间增加 100 毫秒会使销售额减少 1% [ 20],其他人报告说,1 秒的减速会使客户满意度指标降低 16% [ 21 , 22 ]。

High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service. For example, Amazon describes response time requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 in 1,000 requests. This is because the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases—that is, they’re the most valuable customers [19]. It’s important to keep those customers happy by ensuring the website is fast for them: Amazon has also observed that a 100 ms increase in response time reduces sales by 1% [20], and others report that a 1-second slowdown reduces a customer satisfaction metric by 16% [21, 22].

另一方面,优化第 99.99 个百分位数(万分之一的请求中最慢的一个)被认为过于昂贵,并且无法为亚马逊的目的带来足够的好处。减少非常高的百分位数的响应时间很困难,因为它们很容易受到您无法控制的随机事件的影响,并且带来的好处正在减少。

On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and to not yield enough benefit for Amazon’s purposes. Reducing response times at very high percentiles is difficult because they are easily affected by random events outside of your control, and the benefits are diminishing.

例如,百分位数通常用于服务级别目标(SLO) 和服务级别协议(SLA)、定义服务的预期性能和可用性的合同。SLA 可能规定,如果服务的中值响应时间小于 200 毫秒且第 99 个百分位数低于 1 秒(如果响应时间更长,则服务可能会关闭),则该服务被视为已启动,并且该服务可能需要至少 99.9% 的时间处于运行状态。这些指标设定了服务客户的期望,并允许客户在未满足 SLA 的情况下要求退款。

For example, percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs), contracts that define the expected performance and availability of a service. An SLA may state that the service is considered to be up if it has a median response time of less than 200 ms and a 99th percentile under 1 s (if the response time is longer, it might as well be down), and the service may be required to be up at least 99.9% of the time. These metrics set expectations for clients of the service and allow customers to demand a refund if the SLA is not met.

排队延迟通常占高百分位响应时间的很大一部分。由于服务器只能并行处理少量事物(例如,受 CPU 核心数量的限制),因此只需要少量缓慢的请求即可阻止后续请求的处理,这种效果有时称为队头阻塞。即使服务器上处理这些后续请求的速度很快,客户端也会因为等待先前请求完成的时间而看到整体响应时间很慢。由于这种影响,测量客户端的响应时间非常重要。

Queueing delays often account for a large part of the response time at high percentiles. As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect sometimes known as head-of-line blocking. Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete. Due to this effect, it is important to measure response times on the client side.

当为了测试系统的可扩展性而人为生成负载时,负载生成客户端需要独立于响应时间持续发送请求。如果客户端在发送下一个请求之前等待上一个请求完成,则该行为会人为地使测试中的队列比实际情况短,从而使测量结果出现偏差 [23 ]

When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements [23].

迪迪亚0105
图 1-5。当需要多个后端调用来服务一个请求时,只需一个缓慢的后端请求即可减慢整个最终用户请求的速度。

应对负荷的方法

Approaches for Coping with Load

现在我们已经讨论了描述负载的参数和衡量性能的指标,我们可以开始认真讨论可扩展性:即使负载参数增加了一定量,我们如何保持良好的性能?

Now that we have discussed the parameters for describing load and metrics for measuring performance, we can start discussing scalability in earnest: how do we maintain good performance even when our load parameters increase by some amount?

适合某一级别负载的架构不太可能应对 10 倍的负载。如果您正在开发快速增长的服务,那么您可能需要在负载每次增加一个数量级时重新考虑您的架构,甚至可能比这更频繁。

An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase—or perhaps even more often than that.

人们经常谈论纵向扩展垂直扩展,迁移到更强大的机器)和横向扩展水平扩展,将负载分布在多个较小的机器上)之间的二分法。在多台机器之间分配负载也称为无共享 架构。可以在单台机器上运行的系统通常更简单,但高端机器可能会变得非常昂贵,因此非常密集的工作负载通常无法避免横向扩展。实际上,好的架构通常涉及多种方法的实用组合:例如,使用几台功能相当强大的机器仍然比使用大量小型虚拟机更简单、更便宜。

People often talk of a dichotomy between scaling up (vertical scaling, moving to a more powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines). Distributing load across multiple machines is also known as a shared-nothing architecture. A system that can run on a single machine is often simpler, but high-end machines can become very expensive, so very intensive workloads often can’t avoid scaling out. In reality, good architectures usually involve a pragmatic mixture of approaches: for example, using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

有些系统是弹性的,这意味着它们可以在检测到负载增加时自动添加计算资源,而其他系统则手动扩展(人类分析容量并决定向系统添加更多机器)。如果负载高度不可预测,弹性系统可能会很有用,但手动扩展的系统更简单,并且操作意外可能更少(请参阅“重新平衡分区”)。

Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises (see “Rebalancing Partitions”).

虽然在多台机器上分发无状态服务相当简单,但将有状态数据系统从单个节点转移到分布式设置可能会带来很多额外的复杂性。因此,直到最近,普遍的看法是将数据库保留在单个节点上(纵向扩展),直到扩展成本或高可用性要求迫使您将其分布式。

While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity. For this reason, common wisdom until recently was to keep your database on a single node (scale up) until scaling cost or high-availability requirements forced you to make it distributed.

随着分布式系统的工具和抽象变得更好,这种常识可能会改变,至少对于某些类型的应用程序来说是这样。可以想象,分布式数据系统将成为未来的默认系统,即使对于不处理大量数据或流量的用例也是如此。在本书的其余部分中,我们将讨论多种分布式数据系统,并讨论它们不仅在可扩展性方面的表现,而且在易用性和可维护性方面的表现。

As the tools and abstractions for distributed systems get better, this common wisdom may change, at least for some kinds of applications. It is conceivable that distributed data systems will become the default in the future, even for use cases that don’t handle large volumes of data or traffic. Over the course of the rest of this book we will cover many kinds of distributed data systems, and discuss how they fare not just in terms of scalability, but also ease of use and maintainability.

大规模运行的系统架构通常对于应用程序来说是高度特定的——不存在通用的、一刀切的可扩展架构(非正式地称为“魔法缩放酱” 。问题可能是读取量、写入量、要存储的数据量、数据的复杂性、响应时间要求、访问模式,或者(通常)所有这些加上更多问题的某种混合。

The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce). The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.

例如,设计为每秒处理 100,000 个请求(每个请求大小为 1 kB)的系统与设计为每分钟处理 3 个请求(每个请求大小为 2 GB)的系统看起来非常不同,尽管这两个系统具有相同的性能数据吞吐量。

For example, a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a system that is designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same data throughput.

对于特定应用程序来说,可以很好地扩展的架构是围绕哪些操作是常见的、哪些是罕见的(负载参数)的假设而构建的。如果这些假设被证明是错误的,那么扩展的工程工作往好了说就是浪费,往坏了说是适得其反。在早期启动或未经验证的产品中,能够快速迭代产品功能通常比扩展到某些假设的未来负载更重要。

An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare—the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive. In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load.

尽管它们特定于特定应用程序,但可扩展架构通常是由通用构建块构建的,并以熟悉的模式排列。在本书中,我们讨论这些构建块和模式。

Even though they are specific to a particular application, scalable architectures are nevertheless usually built from general-purpose building blocks, arranged in familiar patterns. In this book we discuss those building blocks and patterns.

可维护性

Maintainability

众所周知,软件的大部分成本不是在最初的开发中,而是在持续的维护中——修复错误、保持系统运行、调查故障、适应新平台、针对新用例进行修改、偿还技术债务,并添加新功能。

It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.

然而,不幸的是,许多从事软件系统工作的人不喜欢维护所谓的 遗留系统——也许它涉及修复其他人的错误,或者使用现在已经过时的平台,或者被迫做他们从未打算做的事情的系统。每个遗留系统都有自己令人不快的地方,因此很难给出处理它们的一般建议。

Yet, unfortunately, many people working on software systems dislike maintenance of so-called legacy systems—perhaps it involves fixing other people’s mistakes, or working with platforms that are now outdated, or systems that were forced to do things they were never intended for. Every legacy system is unpleasant in its own way, and so it is difficult to give general recommendations for dealing with them.

然而,我们可以而且应该以这样的方式设计软件,希望能够最大限度地减少维护过程中的痛苦,从而避免我们自己创建遗留软件。为此,我们将特别关注软件系统的三个设计原则:

However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves. To this end, we will pay particular attention to three design principles for software systems:

操作性
Operability

让运营团队轻松保持系统平稳运行。

Make it easy for operations teams to keep the system running smoothly.

简单
Simplicity

通过尽可能消除系统的复杂性,使新工程师能够轻松理解系统。(请注意,这与用户界面的简单性不同。)

Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)

进化性
Evolvability

使工程师将来可以轻松地对系统进行更改,随着需求的变化使其适应意外的用例。也称为可扩展性可修改性可塑性

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.

与以前的可靠性和可扩展性一样,没有简单的解决方案可以实现这些目标。相反,我们将尝试考虑可操作性、简单性和可进化性的系统。

As previously with reliability and scalability, there are no easy solutions for achieving these goals. Rather, we will try to think about systems with operability, simplicity, and evolvability in mind.

可操作性:让操作变得简单

Operability: Making Life Easy for Operations

有人提出,“良好的操作通常可以解决不良(或不完整)软件的局限性,但良好的软件无法在不良操作的情况下可靠运行”[ 12 ]。虽然操作的某些方面可以而且应该自动化,但仍然需要人类首先设置自动化并确保其正常工作。

It has been suggested that “good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations” [12]. While some aspects of operations can and should be automated, it is still up to humans to set up that automation in the first place and to make sure it’s working correctly.

运营团队对于保持软件系统平稳运行至关重要。一个好的运营团队通常负责以下以及更多的事情 [ 29 ]:

Operations teams are vital to keeping a software system running smoothly. A good operations team typically is responsible for the following, and more [29]:

  • 监控系统的健康状况并在出现不良状态时快速恢复服务

  • Monitoring the health of the system and quickly restoring service if it goes into a bad state

  • 追踪问题的原因,例如系统故障或性能下降

  • Tracking down the cause of problems, such as system failures or degraded performance

  • 保持软件和平台最新,包括安全补丁

  • Keeping software and platforms up to date, including security patches

  • 密切关注不同系统如何相互影响,以便在造成损害之前避免有问题的更改

  • Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage

  • 预测未来的问题并在问题发生之前解决它们(例如,容量规划)

  • Anticipating future problems and solving them before they occur (e.g., capacity planning)

  • 建立部署、配置管理等方面的良好实践和工具

  • Establishing good practices and tools for deployment, configuration management, and more

  • 执行复杂的维护任务,例如将应用程序从一个平台移动到另一个平台

  • Performing complex maintenance tasks, such as moving an application from one platform to another

  • 在进行配置更改时维护系统的安全

  • Maintaining the security of the system as configuration changes are made

  • 定义使运营可预测并有助于保持生产环境稳定的流程

  • Defining processes that make operations predictable and help keep the production environment stable

  • 即使个人来来去去,也能保留组织对系统的了解

  • Preserving the organization’s knowledge about the system, even as individual people come and go

良好的可操作性意味着让日常任务变得简单,让运营团队能够将精力集中在高价值的活动上。数据系统可以做很多事情来简化日常任务,包括:

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy, including:

  • 通过良好的监控,提供对系统运行时行为和内部结构的可见性

  • Providing visibility into the runtime behavior and internals of the system, with good monitoring

  • 为自动化和与标准工具的集成提供良好的支持

  • Providing good support for automation and integration with standard tools

  • 避免对单个机器的依赖(允许关闭机器进行维护,而整个系统继续不间断运行)

  • Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted)

  • 提供良好的文档和易于理解的操作模型(“如果我做 X,Y 就会发生”)

  • Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)

  • 提供良好的默认行为,同时也让管理员可以在需要时自由地覆盖默认值

  • Providing good default behavior, but also giving administrators the freedom to override defaults when needed

  • 在适当的情况下进行自我修复,但还可以在需要时让管理员手动控制系统状态

  • Self-healing where appropriate, but also giving administrators manual control over the system state when needed

  • 表现出可预测的行为,最大限度地减少意外

  • Exhibiting predictable behavior, minimizing surprises

简单性:管理复杂性

Simplicity: Managing Complexity

小型软件项目可以拥有非常简单且富有表现力的代码,但随着项目变得更大,它们通常会变得非常复杂且难以理解。这种复杂性降低了每个需要在系统上工作的人的速度,进一步增加了维护成本。陷入复杂性的软件项目有时被描述为一个大泥球 [ 30 ]。

Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. A software project mired in complexity is sometimes described as a big ball of mud [30].

复杂性有各种可能的症状:状态空间的爆炸、模块的紧密耦合、纠结的依赖关系、不一致的命名和术语、旨在解决性能问题的黑客、解决其他地方问题的特殊情况等等。关于这个话题 已经说了很多了[ 31,32,33 ]

There are various possible symptoms of complexity: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere, and many more. Much has been said on this topic already [31, 32, 33].

当复杂性导致维护变得困难时,预算和时间表常常会超支。在复杂的软件中,进行更改时引入错误的风险也更大:当开发人员更难理解和推理系统时,隐藏的假设、意想不到的后果和意外的交互更容易被忽视。相反,降低复杂性可以大大提高软件的可维护性,因此简单性应该是我们构建的系统的一个关键目标。

When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked. Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build.

使系统更简单并不一定意味着减少其功能;它也可能意味着消除偶然的复杂性。Moseley 和 Marks [ 32 ] 将复杂性定义为偶然的,如果它不是软件解决的问题所固有的(如用户所见),而是仅在实现中出现。

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [32] define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

抽象 是我们消除偶然复杂性的最佳工具之一。良好的抽象可以在干净、易于理解的外观背后隐藏大量的实现细节。良好的抽象还可以用于广泛的不同应用。这种重用不仅比多次重新实现类似的东西更有效,而且还可以带来更高质量的软件,因为抽象组件的质量改进有利于所有使用它的应用程序。

One of the best tools we have for removing accidental complexity is abstraction. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade. A good abstraction can also be used for a wide range of different applications. Not only is this reuse more efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality software, as quality improvements in the abstracted component benefit all applications that use it.

例如,高级编程语言是隐藏机器代码、CPU 寄存器和系统调用的抽象。SQL 是一种抽象,它隐藏了复杂的磁盘和内存数据结构、来自其他客户端的并发请求以及崩溃后的不一致。当然,在用高级语言编程时,我们仍然使用机器代码;我们只是不 直接使用它,因为编程语言的抽象使我们不必考虑它。

For example, high-level programming languages are abstractions that hide machine code, CPU registers, and syscalls. SQL is an abstraction that hides complex on-disk and in-memory data structures, concurrent requests from other clients, and inconsistencies after crashes. Of course, when programming in a high-level language, we are still using machine code; we are just not using it directly, because the programming language abstraction saves us from having to think about it.

然而,找到好的抽象是非常困难的。在分布式系统领域,虽然有很多好的算法,但我们应该如何将它们包装成抽象来帮助我们将系统的复杂性保持在可管理的水平,这一点还不清楚。

However, finding good abstractions is very hard. In the field of distributed systems, although there are many good algorithms, it is much less clear how we should be packaging them into abstractions that help us keep the complexity of the system at a manageable level.

在本书中,我们将密切关注良好的抽象,这些抽象使我们能够将大型系统的各个部分提取为定义良好的、可重用的组件。

Throughout this book, we will keep our eyes open for good abstractions that allow us to extract parts of a large system into well-defined, reusable components.

可进化性:让改变变得容易

Evolvability: Making Change Easy

您的系统要求永远保持不变的可能性极小。它们更有可能不断变化:您了解新的事实,出现以前未预料到的用例,业务优先级发生变化,用户请求新功能,新平台取代旧平台,法律或监管要求发生变化,系统的增长迫使架构发生变化, ETC。

It’s extremely unlikely that your system’s requirements will remain unchanged forever. They are much more likely to be in constant flux: you learn new facts, previously unanticipated use cases emerge, business priorities change, users request new features, new platforms replace old platforms, legal or regulatory requirements change, growth of the system forces architectural changes, etc.

在组织流程方面,敏捷工作模式提供了适应变化的框架。敏捷社区还开发了技术工具和模式,这些工具和模式在频繁变化的环境中开发软件时很有帮助,例如测试驱动开发(TDD)和重构。

In terms of organizational processes, Agile working patterns provide a framework for adapting to change. The Agile community has also developed technical tools and patterns that are helpful when developing software in a frequently changing environment, such as test-driven development (TDD) and refactoring.

这些敏捷技术的大多数讨论都集中在相当小的本地规模(同一应用程序中的几个源代码文件)。在本书中,我们寻找在更大的数据系统级别上提高敏捷性的方法,该系统可能由具有不同特征的几个不同的应用程序或服务组成。例如,您将如何“重构”Twitter 的用于组装主页时间线(“描述负载”)的架构,从方法 1 到方法 2?

Most discussions of these Agile techniques focus on a fairly small, local scale (a couple of source code files within the same application). In this book, we search for ways of increasing agility on the level of a larger data system, perhaps consisting of several different applications or services with different characteristics. For example, how would you “refactor” Twitter’s architecture for assembling home timelines (“Describing Load”) from approach 1 to approach 2?

修改数据系统并使其适应不断变化的需求的容易程度与其简单性和抽象性密切相关:简单且易于理解的系统通常比复杂的系统更容易修改。但由于这是一个如此重要的想法,我们将使用不同的词来指代数据系统级别的敏捷性:可进化性 [ 34 ]。

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability [34].

概括

Summary

在本章中,我们探索了一些思考数据密集型应用程序的基本方法。这些原则将指导我们完成本书的其余部分,我们将深入探讨深入的技术细节。

In this chapter, we have explored some fundamental ways of thinking about data-intensive applications. These principles will guide us through the rest of the book, where we dive into deep technical detail.

应用程序必须满足各种要求才能发挥作用。有功能性需求(它应该做什么,例如允许以各种方式存储、检索、搜索和处理数据),以及一些非功能性需求(一般属性,例如安全性、可靠性、合规性、可扩展性、兼容性和可维护性) 。在本章中,我们详细讨论了可靠性、可扩展性和可维护性。

An application has to meet various requirements in order to be useful. There are functional requirements (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some nonfunctional requirements (general properties like security, reliability, compliance, scalability, compatibility, and maintainability). In this chapter we discussed reliability, scalability, and maintainability in detail.

可靠性意味着即使发生故障,系统也能正常工作。故障可能来自硬件(通常是随机且不相关的)、软件(错误通常是系统性的且难以处理)和人类(不可避免地会时不时地犯错误)。容错技术可以向最终用户隐藏某些类型的故障。

Reliability means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically systematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user.

可扩展性意味着即使在负载增加时也要制定保持良好性能的策略。为了讨论可扩展性,我们首先需要定量描述负载和性能的方法。我们简要地查看了 Twitter 的主页时间线作为描述负载的示例,并将响应时间百分位数作为衡量性能的一种方式。在可扩展的系统中,您可以添加处理能力,以便在高负载下保持可靠。

Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitter’s home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load.

可维护性有很多方面,但本质上它是为了让需要使用系统的工程和运营团队的生活变得更好。良好的抽象有助于降低复杂性,并使系统更容易修改和适应新的用例。良好的可操作性意味着对系统的健康状况有良好的可见性,并有有效的方法来管理它。

Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstractions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it.

不幸的是,没有简单的解决办法可以使应用程序变得可靠、可扩展或可维护。然而,某些模式和技术不断出现在不同类型的应用程序中。在接下来的几章中,我们将看一些数据系统的示例,并分析它们如何实现这些目标。

There is unfortunately no easy fix for making applications reliable, scalable, or maintainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals.

在本书后面的第三部分中,我们将研究由多个协同工作的组件组成的系统模式,如图1-1所示。

Later in the book, in Part III, we will look at patterns for systems that consist of several components working together, such as the one in Figure 1-1.

脚注

i在“应对负载的方法”中定义。

i Defined in “Approaches for Coping with Load”.

ii借用自电子工程的术语,它描述连接到另一个门输出的逻辑门输入的数量。输出需要提供足够的电流来驱动所有连接的输入。在事务处理系统中,我们用它来描述为了满足一个传入请求而需要向其他服务发出的请求数量。

ii A term borrowed from electronic engineering, where it describes the number of logic gate inputs that are attached to another gate’s output. The output needs to supply enough current to drive all the attached inputs. In transaction processing systems, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request.

iii在理想情况下,批处理作业的运行时间是数据集大小除以吞吐量。在实践中,由于偏差(数据没有均匀地分布在工作进程之间)并且需要等待最慢的任务完成,运行时间通常会更长。

iii In an ideal world, the running time of a batch job is the size of the dataset divided by the throughput. In practice, the running time is often longer, due to skew (data not being spread evenly across worker processes) and needing to wait for the slowest task to complete.

参考

[ 1 ] Michael Stonebraker 和 Uğur Çetintemel:“ ‘一刀切’:一个时代已经过去又过去的想法”,第21 届国际数据工程会议(ICDE),2005 年 4 月。

[1] Michael Stonebraker and Uğur Çetintemel: “‘One Size Fits All’: An Idea Whose Time Has Come and Gone,” at 21st International Conference on Data Engineering (ICDE), April 2005.

[ 2 ] Walter L. Heimerdinger 和 Charles B. Weinstock:“系统容错的概念框架”,技术报告 CMU/SEI-92-TR-033,卡内基梅隆大学软件工程研究所,1992 年 10 月。

[2] Walter L. Heimerdinger and Charles B. Weinstock: “A Conceptual Framework for System Fault Tolerance,” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.

[ 3 ] 丁远、罗宇、庄鑫等人:“简单的测试可以预防最关键的故障:分布式数据密集型系统中生产故障的分析”,第 11 届 USENIX 操作系统设计与实现研讨会(OSDI) ),2014 年 10 月。

[3] Ding Yuan, Yu Luo, Xin Zhuang, et al.: “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

[ 4 ] Yury Izrailevsky 和 ​​Ariel Tseitlin:“ Netflix Simian Army ”, techblog.netflix.com,2011 年 7 月 19 日。

[4] Yury Izrailevsky and Ariel Tseitlin: “The Netflix Simian Army,” techblog.netflix.com, July 19, 2011.

[ 5 ] Daniel Ford、François Labelle、Florentina I. Popovici 等人:“全球分布式存储系统的可用性”,第 9 届 USENIX 操作系统设计和实现(OSDI) 研讨会,2010 年 10 月。

[5] Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “Availability in Globally Distributed Storage Systems,” at 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2010.

[ 6 ] Brian Beach:“硬盘可靠性更新 – 2014 年 9 月”,backblaze.com,2014 年 9 月 23 日。

[6] Brian Beach: “Hard Drive Reliability Update – Sep 2014,” backblaze.com, September 23, 2014.

[ 7 ] Laurie Voss:“ AWS:好的、坏的和丑陋的”,blog.awe.sm,2012 年 12 月 18 日。

[7] Laurie Voss: “AWS: The Good, the Bad and the Ugly,” blog.awe.sm, December 18, 2012.

[ 8 ] Haryadi S. Gunawi、Mingzhehao、Tanakorn Leesatapornwongsa 等人:“云中存在哪些 bug?”,第五届 ACM 云计算研讨会(SoCC),2014 年 11 月 。doi:10.1145/2670979.2670986

[8] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “What Bugs Live in the Cloud?,” at 5th ACM Symposium on Cloud Computing (SoCC), November 2014. doi:10.1145/2670979.2670986

[ 9 ] Nelson Minar:“闰秒让半个互联网崩溃”,somebits.com,2012 年 7 月 3 日。

[9] Nelson Minar: “Leap Second Crashes Half the Internet,” somebits.com, July 3, 2012.

[ 10 ] Amazon Web Services:“美国东部地区 Amazon EC2 和 Amazon RDS 服务中断摘要”,aws.amazon.com,2011 年 4 月 29 日。

[10] Amazon Web Services: “Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region,” aws.amazon.com, April 29, 2011.

[ 11 ] Richard I. Cook:“复杂系统如何失败”,认知技术实验室,2000 年 4 月。

[11] Richard I. Cook: “How Complex Systems Fail,” Cognitive Technologies Laboratory, April 2000.

[ 12 ] Jay Kreps:“真正了解分布式系统可靠性”,blog.empathybox.com,2012 年 3 月 19 日。

[12] Jay Kreps: “Getting Real About Distributed System Reliability,” blog.empathybox.com, March 19, 2012.

[ 13 ] David Oppenheimer、Archana Ganapathi 和 David A. Patterson:“为什么互联网服务会失败,可以采取什么措施?”,第四届 USENIX 互联网技术和系统研讨会(USITS),2003 年 3 月。

[13] David Oppenheimer, Archana Ganapathi, and David A. Patterson: “Why Do Internet Services Fail, and What Can Be Done About It?,” at 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.

[ 14 ] Nathan Marz:“软件工程原理,第 1 部分”,nathanmarz.com,2013 年 4 月 2 日。

[14] Nathan Marz: “Principles of Software Engineering, Part 1,” nathanmarz.com, April 2, 2013.

[ 15 ] Michael Jurewitz:“虫子对人类的影响”, jury.me,2013 年 3 月 15 日。

[15] Michael Jurewitz: “The Human Impact of Bugs,” jury.me, March 15, 2013.

[ 16 ] Raffi Krikorian:“大规模时间线”,旧金山 QCon,2012 年 11 月。

[16] Raffi Krikorian: “Timelines at Scale,” at QCon San Francisco, November 2012.

[ 17 ] Martin Fowler: 企业应用程序架构模式。艾迪生韦斯利,2002。ISBN:978-0-321-12742-6

[17] Martin Fowler: Patterns of Enterprise Application Architecture. Addison Wesley, 2002. ISBN: 978-0-321-12742-6

[ 18 ]Kelly Sommers:“经过一番折腾,即使我们更换了物理服务器,是什么导致了 500 毫秒的磁盘延迟?twitter.com,2014 年 11 月 13 日。

[18] Kelly Sommers: “After all that run around, what caused 500ms disk latency even when we replaced physical server?twitter.com, November 13, 2014.

[ 19 ] Giuseppe DeCandia、Deniz Hastorun、Madan Jampani 等人:“ Dynamo:Amazon 的高可用键值存储”,第21 届 ACM 操作系统原则研讨会(SOSP),2007 年 10 月。

[19] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “Dynamo: Amazon’s Highly Available Key-Value Store,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

[ 20 ] Greg Linden:“让数据变得有用”,斯坦福大学数据挖掘课程 (CS345) 演示幻灯片,2006 年 12 月。

[20] Greg Linden: “Make Data Useful,” slides from presentation at Stanford University Data Mining class (CS345), December 2006.

[ 21 ] Tammy Everts:“缓慢时间与停机时间的实际成本”,webperformancetoday.com,2014 年 11 月 12 日。

[21] Tammy Everts: “The Real Cost of Slow Time vs Downtime,” webperformancetoday.com, November 12, 2014.

[ 22 ] Jake Brutlag:“ Google 网络搜索的速度很重要”,googleresearch.blogspot.co.uk,2009 年 6 月 22 日。

[22] Jake Brutlag: “Speed Matters for Google Web Search,” googleresearch.blogspot.co.uk, June 22, 2009.

[ 23 ] Tyler Treat:“您对延迟的了解都是错误的”,bravenewgeek.com,2015 年 12 月 12 日。

[23] Tyler Treat: “Everything You Know About Latency Is Wrong,” bravenewgeek.com, December 12, 2015.

[ 24 ] Jeffrey Dean 和 Luiz André Barroso:“ The Tail at Scale ”, ACM 通讯,第 56 卷,第 2 期,第 74-80 页,2013 年 2 月 。doi:10.1145/2408776.2408794

[24] Jeffrey Dean and Luiz André Barroso: “The Tail at Scale,” Communications of the ACM, volume 56, number 2, pages 74–80, February 2013. doi:10.1145/2408776.2408794

[ 25 ] Graham Cormode、Vladislav Shkapenyuk、Divesh Srivastava 和 Bojian Xu:“前向衰减:流系统的实用时间衰减模型”,第 25 届 IEEE 国际数据工程会议(ICDE),2009 年 3 月。

[25] Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “Forward Decay: A Practical Time Decay Model for Streaming Systems,” at 25th IEEE International Conference on Data Engineering (ICDE), March 2009.

[ 26 ] Ted Dunning 和 Otmar Ertl:“使用 t-Digest 计算极其准确的分位数”,github.com,2014 年 3 月。

[26] Ted Dunning and Otmar Ertl: “Computing Extremely Accurate Quantiles Using t-Digests,” github.com, March 2014.

[ 27 ] Gil Tene:“ HdrHistogram ”,hdrhistogram.org

[27] Gil Tene: “HdrHistogram,” hdrhistogram.org.

[ 28 ]Baron Schwartz:“为什么百分位数不像你想象的那样工作”,vividcortex.com,2015 年 12 月 7 日。

[28] Baron Schwartz: “Why Percentiles Don’t Work the Way You Think,” vividcortex.com, December 7, 2015.

[ 29 ] James Hamilton:“论设计和部署互联网规模服务”,第 21 届大型安装系统管理会议(LISA),2007 年 11 月。

[29] James Hamilton: “On Designing and Deploying Internet-Scale Services,” at 21st Large Installation System Administration Conference (LISA), November 2007.

[ 30 ] Brian Foote 和 Joseph Yoder:“ Big Ball of Mud ”, 第 4 届程序模式语言会议(PLoP),1997 年 9 月。

[30] Brian Foote and Joseph Yoder: “Big Ball of Mud,” at 4th Conference on Pattern Languages of Programs (PLoP), September 1997.

[ 31 ] Frederick P Brooks:“没有银弹 - 软件工程中的本质和意外”,载于《人月神话》,周年纪念版,Addison-Wesley,1995 年。ISBN:978-0-201-83595-3

[31] Frederick P Brooks: “No Silver Bullet – Essence and Accident in Software Engineering,” in The Mythical Man-Month, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3

[ 32 ] Ben Moseley 和 Peter Marks:“走出焦油坑”,BCS 软件实践进步(SPA),2006 年。

[32] Ben Moseley and Peter Marks: “Out of the Tar Pit,” at BCS Software Practice Advancement (SPA), 2006.

[ 33 ] Rich Hickey:“ Simple Made Easy ”,Strange Loop,2011 年 9 月。

[33] Rich Hickey: “Simple Made Easy,” at Strange Loop, September 2011.

[ 34 ] Hongyu Pei Breivold、Ivica Crnkovic 和 Peter J. Eriksson:“分析软件可演化性”,第32 届 IEEE 国际计算机软件和应用年会 (COMPSAC),2008 年 7 月 。doi:10.1109/COMPSAC.2008.50

[34] Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “Analyzing Software Evolvability,” at 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC), July 2008. doi:10.1109/COMPSAC.2008.50

第 2 章数据模型和查询语言

Chapter 2. Data Models and Query Languages

我的语言的限制意味着我的世界的限制。

路德维希·维特根斯坦,逻辑哲学论(1922)

The limits of my language mean the limits of my world.

Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1922)

数据模型可能是软件开发中最重要的部分,因为它们具有如此深远的影响:不仅影响软件的编写方式,还影响我们如何思考正在解决的 问题

Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we think about the problem that we are solving.

大多数应用程序都是通过将一个数据模型分层在另一个数据模型之上来构建的。对于每一层,关键问题是:它如何用下一层表示?例如:

Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer? For example:

  1. 作为应用程序开发人员,您关注现实世界(其中有人员、组织、商品、行为、资金流、传感器等),并根据对象或数据结构以及操作这些数据结构的 API 对其进行建模。这些结构通常特定于您的应用程序。

  2. As an application developer, you look at the real world (in which there are people, organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or data structures, and APIs that manipulate those data structures. Those structures are often specific to your application.

  3. 当您想要存储这些数据结构时,可以使用通用数据模型(例如 JSON 或 XML 文档、关系数据库中的表或图形模型)来表达它们。

  4. When you want to store those data structures, you express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a relational database, or a graph model.

  5. 构建数据库软件的工程师决定采用一种在内存、磁盘或网络上以字节形式表示 JSON/XML/关系/图形数据的方法。该表示可以允许以各种方式查询、搜索、操纵和处理数据。

  6. The engineers who built your database software decided on a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network. The representation may allow the data to be queried, searched, manipulated, and processed in various ways.

  7. 在更低的层面上,硬件工程师已经弄清楚如何用电流、光脉冲、磁场等来表示字节。

  8. On yet lower levels, hardware engineers have figured out how to represent bytes in terms of electrical currents, pulses of light, magnetic fields, and more.

在复杂的应用程序中,可能会有更多的中间层,例如构建在 API 之上的 API,但基本思想仍然是相同的:每一层通过提供干净的数据模型来隐藏其下面各层的复杂性。这些抽象允许不同的人员群体(例如,数据库供应商的工程师和使用其数据库的应用程序开发人员)有效地协同工作。

In a complex application there may be more intermediary levels, such as APIs built upon APIs, but the basic idea is still the same: each layer hides the complexity of the layers below it by providing a clean data model. These abstractions allow different groups of people—for example, the engineers at the database vendor and the application developers using their database—to work together effectively.

有许多不同类型的数据模型,每种数据模型都体现了有关如何使用它的假设。有些用法很简单,有些则不支持;有些操作很快,有些则执行很差;有些数据转换感觉很自然,有些则很尴尬。

There are many different kinds of data models, and every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward.

掌握一种数据模型可能需要付出很大的努力(想想有多少关于关系数据建模的书籍)。构建软件已经足够困难了,即使只使用一种数据模型并且不担心其内部工作原理也是如此。但由于数据模型对其之上的软件能做什么和不能做什么有如此深远的影响,因此选择适合应用程序的模型非常重要。

It can take a lot of effort to master just one data model (think how many books there are on relational data modeling). Building software is hard enough, even when working with just one data model and without worrying about its inner workings. But since the data model has such a profound effect on what the software above it can and can’t do, it’s important to choose one that is appropriate to the application.

在本章中,我们将研究一系列用于数据存储和查询的通用数据模型(前面列表中的第 2 点)。特别是,我们将比较关系模型、文档模型和一些基于图的数据模型。我们还将研究各种查询语言并比较它们的用例。在第3章中我们将讨论存储引擎是如何工作的;也就是说,这些数据模型是如何实际实现的(列表中的第 3 点)。

In this chapter we will look at a range of general-purpose data models for data storage and querying (point 2 in the preceding list). In particular, we will compare the relational model, the document model, and a few graph-based data models. We will also look at various query languages and compare their use cases. In Chapter 3 we will discuss how storage engines work; that is, how these data models are actually implemented (point 3 in the list).

关系模型与文档模型

Relational Model Versus Document Model

如今最著名的数据模型可能是 SQL 模型,它基于 Edgar Codd 在 1970 年提出的关系模型 [ 1 ]:数据被组织成关系(在 SQL 中称为表),其中每个关系都是元的无序集合( SQL 中的)。

The best-known data model today is probably that of SQL, based on the relational model proposed by Edgar Codd in 1970 [1]: data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

关系模型只是一个理论建议,当时很多人都怀疑它是否能有效实施。然而,到了 20 世纪 80 年代中期,关系数据库管理系统 (RDBMS) 和 SQL 已成为大多数需要存储和查询具有某种规则结构的数据的人的首选工具。关系数据库的主导地位持续了大约 25-30 年,这在计算史上是永恒的。

The relational model was a theoretical proposal, and many people at the time doubted whether it could be implemented efficiently. However, by the mid-1980s, relational database management systems (RDBMSes) and SQL had become the tools of choice for most people who needed to store and query data with some kind of regular structure. The dominance of relational databases has lasted around 25‒30 years—an eternity in computing history.

关系数据库的根源在于业务数据处理,它是在 20 世纪 60 年代和 70 年代在大型计算机上执行的。从今天的角度来看,这些用例显得很平常:通常是事务处理(输入销售或银行交易、航班预订、仓库库存)和批处理(客户发票、工资单、报告)。

The roots of relational databases lie in business data processing, which was performed on mainframe computers in the 1960s and ’70s. The use cases appear mundane from today’s perspective: typically transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).

当时的其他数据库迫使应用程序开发人员深入思考数据库中数据的内部表示。关系模型的目标是将实现细节隐藏在更清晰的界面后面。

Other databases at that time forced application developers to think a lot about the internal representation of the data in the database. The goal of the relational model was to hide that implementation detail behind a cleaner interface.

多年来,出现了许多相互竞争的数据存储和查询方法。在 20 世纪 70 年代和 80 年代初期,网络模型层次模型是主要的选择,但关系模型开始占据主导地位。对象数据库在 20 世纪 80 年代末和 90 年代初来了又去。XML 数据库出现于 2000 年代初期,但只得到了小范围的采用。关系模型的每个竞争对手在当时都引起了大量的炒作,但它从未持续多久[ 2 ]。

Over the years, there have been many competing approaches to data storage and querying. In the 1970s and early 1980s, the network model and the hierarchical model were the main alternatives, but the relational model came to dominate them. Object databases came and went again in the late 1980s and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each competitor to the relational model generated a lot of hype in its time, but it never lasted [2].

随着计算机变得更加强大和联网,它们开始被用于越来越多样化的用途。值得注意的是,关系数据库的泛化能力非常好,超出了其最初的业务数据处理范围,扩展到了广泛的用例。您今天在网络上看到的大部分内容仍然由关系数据库提供支持,无论是在线发布、讨论、社交网络、电子商务、游戏、软件即服务生产力应用程序还是更多。

As computers became vastly more powerful and networked, they started being used for increasingly diverse purposes. And remarkably, relational databases turned out to generalize very well, beyond their original scope of business data processing, to a broad variety of use cases. Much of what you see on the web today is still powered by relational databases, be it online publishing, discussion, social networking, ecommerce, games, software-as-a-service productivity applications, or much more.

NoSQL 的诞生

The Birth of NoSQL

现在,进入 2010 年代,NoSQL是推翻关系模型统治地位的最新尝试。“NoSQL”这个名字很不幸,因为它实际上并不指任何特定的技术——它最初只是作为 2009 年开源、分布式、非关系数据库聚会的一个吸引人的 Twitter 标签[3 ]。尽管如此,这个词还是引起了人们的注意,并迅速在网络创业社区内外传播开来。许多有趣的数据库系统现在都与 #NoSQL 标签相关联,并且它已被追溯重新解释为Not Only SQL [ 4 ]。

Now, in the 2010s, NoSQL is the latest attempt to overthrow the relational model’s dominance. The name “NoSQL” is unfortunate, since it doesn’t actually refer to any particular technology—it was originally intended simply as a catchy Twitter hashtag for a meetup on open source, distributed, nonrelational databases in 2009 [3]. Nevertheless, the term struck a nerve and quickly spread through the web startup community and beyond. A number of interesting database systems are now associated with the #NoSQL hashtag, and it has been retroactively reinterpreted as Not Only SQL [4].

采用 NoSQL 数据库背后有多种驱动力,包括:

There are several driving forces behind the adoption of NoSQL databases, including:

  • 需要比关系数据库可以轻松实现的更高可扩展性,包括非常大的数据集或非常高的写入吞吐量

  • A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput

  • 与商业数据库产品相比,人们普遍偏爱免费和开源软件

  • A widespread preference for free and open source software over commercial database products

  • 关系模型不能很好支持的专门查询操作

  • Specialized query operations that are not well supported by the relational model

  • 对关系模式的限制性感到沮丧,并渴望更动态和更具表现力的数据模型 [ 5 ]

  • Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model [5]

不同的应用程序有不同的要求,一个用例的最佳技术选择可能与另一个用例的最佳技术选择不同。因此,在可预见的未来,关系数据库似乎将继续与各种非关系数据存储一起使用——这种想法有时被称为多语言持久性 [ 3 ]。

Different applications have different requirements, and the best choice of technology for one use case may well be different from the best choice for another use case. It therefore seems likely that in the foreseeable future, relational databases will continue to be used alongside a broad variety of nonrelational datastores—an idea that is sometimes called polyglot persistence [3].

对象关系不匹配

The Object-Relational Mismatch

如今,大多数应用程序开发都是用面向对象的编程语言完成的,这导致了对 SQL 数据模型的普遍批评:如果数据存储在关系表中,则应用程序代码中的对象和数据库模型之间需要一个尴尬的转换层表、行和列。模型之间的脱节有时称为 阻抗失配

Most application development today is done in object-oriented programming languages, which leads to a common criticism of the SQL data model: if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect between the models is sometimes called an impedance mismatch.i

ActiveRecord 和 Hibernate 等对象关系映射 (ORM) 框架减少了该转换层所需的样板代码量,但它们无法完全隐藏两个模型之间的差异。

Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they can’t completely hide the differences between the two models.

例如,图 2-1说明了如何在关系模式中表达简历(LinkedIn 个人资料)。整个配置文件可以通过唯一标识符 来识别 user_idfirst_name诸如和之类的字段last_name每个用户只出现一次,因此可以将它们建模为users表中的列。然而,大多数人在其职业生涯中从事过不止一份工作(职位),并且人们可能有不同数量的教育时期和任意数量的联系信息。用户与这些项目之间存在一对多的关系,可以用多种方式表示:

For example, Figure 2-1 illustrates how a résumé (a LinkedIn profile) could be expressed in a relational schema. The profile as a whole can be identified by a unique identifier, user_id. Fields like first_name and last_name appear exactly once per user, so they can be modeled as columns on the users table. However, most people have had more than one job in their career (positions), and people may have varying numbers of periods of education and any number of pieces of contact information. There is a one-to-many relationship from the user to these items, which can be represented in various ways:

  • 在传统的 SQL 模型(SQL:1999 之前)中,最常见的规范化表示是将职位、教育和联系信息放在单独的表中,并使用外键引用该表,如图users2-1所示

  • In the traditional SQL model (prior to SQL:1999), the most common normalized representation is to put positions, education, and contact information in separate tables, with a foreign key reference to the users table, as in Figure 2-1.

  • SQL 标准的更高版本添加了对结构化数据类型和 XML 数据的支持;这允许将多值数据存储在单行中,并支持在这些文档内进行查询和索引。Oracle、IBM DB2、MS SQL Server 和 PostgreSQL [ 6 , 7 ]不同程度地支持这些功能。JSON 数据类型还受到多种数据库的支持,包括 IBM DB2、MySQL 和 PostgreSQL [ 8 ]。

  • Later versions of the SQL standard added support for structured datatypes and XML data; this allowed multi-valued data to be stored within a single row, with support for querying and indexing inside those documents. These features are supported to varying degrees by Oracle, IBM DB2, MS SQL Server, and PostgreSQL [6, 7]. A JSON datatype is also supported by several databases, including IBM DB2, MySQL, and PostgreSQL [8].

  • 第三种选择是将工作、教育和联系信息编码为 JSON 或 XML 文档,将其存储在数据库的文本列中,并让应用程序解释其结构和内容。在此设置中,您通常无法使用数据库来查询该编码列内的值。

  • A third option is to encode jobs, education, and contact info as a JSON or XML document, store it on a text column in the database, and let the application interpret its structure and content. In this setup, you typically cannot use the database to query for values inside that encoded column.

迪迪亚0201
图 2-1。使用关系模式表示 LinkedIn 个人资料。比尔·盖茨的照片由维基共享资源、巴西通讯社 Ricardo Stuckert 提供。

对于像简历这样的数据结构(主要是一个独立的文档),JSON 表示可能非常合适:请参见示例 2-1。JSON 的优点是比 XML 简单得多。面向文档的数据库,如 MongoDB [ 9 ]、RethinkDB [ 10 ]、CouchDB [ 11 ] 和 Espresso [ 12 ] 支持这种数据模型。

For a data structure like a résumé, which is mostly a self-contained document, a JSON representation can be quite appropriate: see Example 2-1. JSON has the appeal of being much simpler than XML. Document-oriented databases like MongoDB [9], RethinkDB [10], CouchDB [11], and Espresso [12] support this data model.

示例 2-1。将 LinkedIn 个人资料表示为 JSON 文档
{
  "user_id":     251,
  "first_name":  "Bill",
  "last_name":   "Gates",
  "summary":     "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id":   "us:91",
  "industry_id": 131,
  "photo_url":   "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University",       "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
  ],
  "contact_info": {
    "blog":    "http://thegatesnotes.com",
    "twitter": "http://twitter.com/BillGates"
  }
}
{
  "user_id":     251,
  "first_name":  "Bill",
  "last_name":   "Gates",
  "summary":     "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id":   "us:91",
  "industry_id": 131,
  "photo_url":   "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
    {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University",       "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null}
  ],
  "contact_info": {
    "blog":    "http://thegatesnotes.com",
    "twitter": "http://twitter.com/BillGates"
  }
}

一些开发人员认为 JSON 模型减少了应用程序代码和存储层之间的阻抗不匹配。然而,正如我们将在第 4 章中看到的,JSON 作为数据编码格式也存在问题。缺乏模式经常被认为是一个优点;我们将在“文档模型中的架构灵活性”中讨论这一点。

Some developers feel that the JSON model reduces the impedance mismatch between the application code and the storage layer. However, as we shall see in Chapter 4, there are also problems with JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss this in “Schema flexibility in the document model”.

JSON 表示比图 2-1中的多表模式 具有更好的局部性。如果要获取关系示例中的配置文件,则需要执行多个查询(通过 查询每个表)或在表及其下级表之间执行混乱的多路联接。在 JSON 表示中,所有相关信息都集中在一处,一次查询就足够了。user_idusers

The JSON representation has better locality than the multi-table schema in Figure 2-1. If you want to fetch a profile in the relational example, you need to either perform multiple queries (query each table by user_id) or perform a messy multi-way join between the users table and its subordinate tables. In the JSON representation, all the relevant information is in one place, and one query is sufficient.

从用户个人资料到用户的职位、教育经历和联系信息的一对多关系隐含着数据中的树形结构,而 JSON 表示使这种树形结构变得明确(见图 2-2

The one-to-many relationships from the user profile to the user’s positions, educational history, and contact information imply a tree structure in the data, and the JSON representation makes this tree structure explicit (see Figure 2-2).

迪迪亚0202
图 2-2。一对多关系形成树结构。

多对一和多对多关系

Many-to-One and Many-to-Many Relationships

在上一节的示例 2-1中,region_idindustry_id是作为 ID 给出的,而不是作为纯文本字符串"Greater Seattle Area""Philanthropy"。为什么?

In Example 2-1 in the preceding section, region_id and industry_id are given as IDs, not as plain-text strings "Greater Seattle Area" and "Philanthropy". Why?

如果用户界面具有用于输入地区和行业的自由文本字段,则将它们存储为纯文本字符串是有意义的。但是,拥有标准化的地理区域和行业列表并让用户从下拉列表或自动完成器中进行选择是有好处的:

If the user interface has free-text fields for entering the region and the industry, it makes sense to store them as plain-text strings. But there are advantages to having standardized lists of geographic regions and industries, and letting users choose from a drop-down list or autocompleter:

  • 跨个人资料的风格和拼写一致

  • Consistent style and spelling across profiles

  • 避免歧义(例如,如果有多个城市同名)

  • Avoiding ambiguity (e.g., if there are several cities with the same name)

  • 易于更新——名称仅存储在一个位置,因此如果需要更改(例如,由于政治事件而更改城市名称),可以轻松进行全面更新

  • Ease of updating—the name is stored in only one place, so it is easy to update across the board if it ever needs to be changed (e.g., change of a city name due to political events)

  • 本地化支持——当网站翻译成其他语言时,可以对标准化列表进行本地化,以便以浏览者的语言显示地区和行业

  • Localization support—when the site is translated into other languages, the standardized lists can be localized, so the region and industry can be displayed in the viewer’s language

  • 更好的搜索 - 例如,搜索华盛顿州的慈善家可以匹配此配置文件,因为区域列表可以编码西雅图位于华盛顿的事实(这从字符串中看不出来"Greater Seattle Area"

  • Better search—e.g., a search for philanthropists in the state of Washington can match this profile, because the list of regions can encode the fact that Seattle is in Washington (which is not apparent from the string "Greater Seattle Area")

无论存储 ID 还是文本字符串都存在重复问题。当您使用 ID 时,对人类有意义的信息(例如Philanthropy一词)仅存储在一个位置,而引用它的所有内容都使用 ID(仅在数据库中有意义)。当您直接存储文本时,您会在使用该文本的每个记录中复制对人类有意义的信息。

Whether you store an ID or a text string is a question of duplication. When you use an ID, the information that is meaningful to humans (such as the word Philanthropy) is stored in only one place, and everything that refers to it uses an ID (which only has meaning within the database). When you store the text directly, you are duplicating the human-meaningful information in every record that uses it.

使用ID的优点在于,因为它对人类没有意义,所以它永远不需要改变:即使它标识的信息发生变化,ID也可以保持不变。任何对人类有意义的事情都可能需要在未来的某个时候进行更改,如果该信息重复,则所有冗余副本都需要更新。这会产生写入开销,并带来不一致的风险(其中某些信息副本已更新,但其他副本未更新)。删除此类重复是数据库规范化背后的关键思想。二、

The advantage of using an ID is that because it has no meaning to humans, it never needs to change: the ID can remain the same, even if the information it identifies changes. Anything that is meaningful to humans may need to change sometime in the future—and if that information is duplicated, all the redundant copies need to be updated. That incurs write overheads, and risks inconsistencies (where some copies of the information are updated but others aren’t). Removing such duplication is the key idea behind normalization in databases.ii

笔记

数据库管理员和开发人员喜欢争论规范化和非规范化,但我们暂时暂不进行判断。在本书的第三部分中,我们将回到这个主题并探索处理缓存、非规范化和派生数据的系统方法。

Database administrators and developers love to argue about normalization and denormalization, but we will suspend judgment for now. In Part III of this book we will return to this topic and explore systematic ways of dealing with caching, denormalization, and derived data.

不幸的是,规范化这些数据需要多对一的关系(许多人生活在一个特定的地区,许多人在一个特定的行业工作),这不能很好地适应文档模型。在关系数据库中,通过 ID 引用其他表中的行是很正常的,因为连接很容易。在文档数据库中,一对多树结构不需要联接,并且对联接的支持通常很弱。三、

Unfortunately, normalizing this data requires many-to-one relationships (many people live in one particular region, many people work in one particular industry), which don’t fit nicely into the document model. In relational databases, it’s normal to refer to rows in other tables by ID, because joins are easy. In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak.iii

如果数据库本身不支持联接,则必须通过对数据库进行多个查询来模拟应用程序代码中的联接。(在这种情况下,地区和行业的列表可能很小并且变化很慢,应用程序可以简单地将它们保存在内存中。但是,尽管如此,进行连接的工作还是从数据库转移到了应用程序代码。)

If the database itself does not support joins, you have to emulate a join in application code by making multiple queries to the database. (In this case, the lists of regions and industries are probably small and slow-changing enough that the application can simply keep them in memory. But nevertheless, the work of making the join is shifted from the database to the application code.)

此外,即使应用程序的初始版本非常适合无连接文档模型,随着功能添加到应用程序中,数据也有变得更加互连的趋势。例如,考虑我们可以对简历示例进行一些更改:

Moreover, even if the initial version of an application fits well in a join-free document model, data has a tendency of becoming more interconnected as features are added to applications. For example, consider some changes we could make to the résumé example:

组织和学校作为实体
Organizations and schools as entities

在前面的描述中,organization(用户工作的公司)和school_name (他们就读的公司)只是字符串。也许它们应该是对实体的引用? 然后每个组织、学校或大学都可以拥有自己的网页(带有徽标、新闻提要等);每份简历都可以链接到它提到的组织和学校,并包含它们的徽标和其他信息(参见图 2-3,了解 LinkedIn 的示例)。

In the previous description, organization (the company where the user worked) and school_name (where they studied) are just strings. Perhaps they should be references to entities instead? Then each organization, school, or university could have its own web page (with logo, news feed, etc.); each résumé could link to the organizations and schools that it mentions, and include their logos and other information (see Figure 2-3 for an example from LinkedIn).

建议
Recommendations

假设您要添加一项新功能:一个用户可以为另一个用户编写推荐。推荐会显示在被推荐用户的简历上,以及提出推荐的用户的姓名和照片。如果推荐者更新了他们的照片,他们写的任何推荐都需要反映新照片。因此,推荐应该参考作者的简介。

Say you want to add a new feature: one user can write a recommendation for another user. The recommendation is shown on the résumé of the user who was recommended, together with the name and photo of the user making the recommendation. If the recommender updates their photo, any recommendations they have written need to reflect the new photo. Therefore, the recommendation should have a reference to the author’s profile.

迪迪亚0203
图 2-3。公司名称不仅仅是一个字符串,而是指向公司实体的链接。linkedin.com 的屏幕截图。

图 2-4说明了这些新功能如何需要多对多关系。每个虚线矩形内的数据可以分组为一个文档,但对组织、学校和其他用户的引用需要表示为引用,并且在查询时需要联接。

Figure 2-4 illustrates how these new features require many-to-many relationships. The data within each dotted rectangle can be grouped into one document, but the references to organizations, schools, and other users need to be represented as references, and require joins when queried.

迪迪亚0204
图 2-4。通过多对多关系扩展简历。

文档数据库正在重复历史吗?

Are Document Databases Repeating History?

虽然多对多关系和联接在关系数据库中经常使用,但文档数据库和 NoSQL 重新引发了关于如何最好地在数据库中表示此类关系的争论。这场争论比 NoSQL 更古老——事实上,它可以追溯到最早的计算机化数据库系统。

While many-to-many relationships and joins are routinely used in relational databases, document databases and NoSQL reopened the debate on how best to represent such relationships in a database. This debate is much older than NoSQL—in fact, it goes back to the very earliest computerized database systems.

20 世纪 70 年代最流行的商业数据处理数据库是 IBM 的信息管理系统(IMS),最初是为阿波罗太空计划中的库存管理而开发的,并于 1968 年首次商业发布[ 13 ]。它至今仍在使用和维护,在 IBM 大型机上的 OS/390 上运行 [ 14 ]。

The most popular database for business data processing in the 1970s was IBM’s Information Management System (IMS), originally developed for stock-keeping in the Apollo space program and first commercially released in 1968 [13]. It is still in use and maintained today, running on OS/390 on IBM mainframes [14].

IMS 的设计使用了一种相当简单的数据模型,称为层次模型,它与文档数据库使用的 JSON 模型有一些显着的相似之处 [ 2 ]。它将所有数据表示为嵌套在记录中的记录树,非常类似于图 2-2的 JSON 结构。

The design of IMS used a fairly simple data model called the hierarchical model, which has some remarkable similarities to the JSON model used by document databases [2]. It represented all data as a tree of records nested within records, much like the JSON structure of Figure 2-2.

与文档数据库一样,IMS 非常适合一对多关系,但它使多对多关系变得困难,并且不支持联接。开发人员必须决定是复制(非规范化)数据还是手动解析从一条记录到另一条记录的引用。20 世纪 60 年代和 70 年代的这些问题非常类似于开发人员今天在文档数据库中遇到的问题 [ 15 ]。

Like document databases, IMS worked well for one-to-many relationships, but it made many-to-many relationships difficult, and it didn’t support joins. Developers had to decide whether to duplicate (denormalize) data or to manually resolve references from one record to another. These problems of the 1960s and ’70s were very much like the problems that developers are running into with document databases today [15].

为了解决分层模型的局限性,人们提出了各种解决方案。其中最突出的两个模型是关系模型(后来成为 SQL,并占领了世界)和网络模型(最初拥有大量追随者,但最终逐渐变得默默无闻)。这两个阵营之间的“大辩论”持续了 20 世纪 70 年代的大部分时间[ 2 ]。

Various solutions were proposed to solve the limitations of the hierarchical model. The two most prominent were the relational model (which became SQL, and took over the world) and the network model (which initially had a large following but eventually faded into obscurity). The “great debate” between these two camps lasted for much of the 1970s [2].

由于这两个模型所解决的问题在今天仍然具有重要意义,因此有必要根据今天的情况简要回顾一下这场争论。

Since the problem that the two models were solving is still so relevant today, it’s worth briefly revisiting this debate in today’s light.

网络模型

The network model

该网络模型由数据系统语言会议 (CODASYL) 委员会标准化,并由多个不同的数据库供应商实施;它也称为 CODASYL 模型[ 16 ]。

The network model was standardized by a committee called the Conference on Data Systems Languages (CODASYL) and implemented by several different database vendors; it is also known as the CODASYL model [16].

CODASYL 模型是层次模型的推广。在层次模型的树结构中,每条记录只有一个父记录;在网络模型中,一条记录可以有多个父记录。例如,该"Greater Seattle Area"区域可能有一条记录,并且居住在该区域的每个用户都可以链接到该记录。这允许对多对一和多对多关系进行建模。

The CODASYL model was a generalization of the hierarchical model. In the tree structure of the hierarchical model, every record has exactly one parent; in the network model, a record could have multiple parents. For example, there could be one record for the "Greater Seattle Area" region, and every user who lived in that region could be linked to it. This allowed many-to-one and many-to-many relationships to be modeled.

网络模型中记录之间的链接不是外键,而更像是编程语言中的指针(同时仍然存储在磁盘上)。访问记录的唯一方法是沿着这些链接链从根记录开始的路径。这被称为访问路径

The links between records in the network model were not foreign keys, but more like pointers in a programming language (while still being stored on disk). The only way of accessing a record was to follow a path from a root record along these chains of links. This was called an access path.

在最简单的情况下,访问路径可能类似于链表的遍历:从链表的头部开始,一次查看一条记录,直到找到所需的记录。但在多对多关系的世界中,几个不同的路径可能会导致相同的记录,并且使用网络模型的程序员必须在头脑中跟踪这些不同的访问路径。

In the simplest case, an access path could be like the traversal of a linked list: start at the head of the list, and look at one record at a time until you find the one you want. But in a world of many-to-many relationships, several different paths can lead to the same record, and a programmer working with the network model had to keep track of these different access paths in their head.

CODASYL 中的查询是通过迭代记录列表并遵循访问路径在数据库中移动游标来执行的。如果一条记录有多个父记录(即,来自其他记录的多个传入指针),则应用程序代码必须跟踪所有不同的关系。甚至 CODASYL 委员会成员也承认这就像在 n维数据空间中导航 [ 17 ]。

A query in CODASYL was performed by moving a cursor through the database by iterating over lists of records and following access paths. If a record had multiple parents (i.e., multiple incoming pointers from other records), the application code had to keep track of all the various relationships. Even CODASYL committee members admitted that this was like navigating around an n-dimensional data space [17].

尽管手动访问路径选择能够最有效地利用 20 世纪 70 年代非常有限的硬件功能(例如磁带驱动器,其寻道速度极慢),但问题是它们使查询和更新数据库的代码变得复杂且不灵活。对于分层模型和网络模型,如果您没有通往所需数据的路径,那么您就会陷入困境。您可以更改访问路径,但随后您必须执行大量手写的数据库查询代码并重写它以处理新的访问路径。更改应用程序的数据模型很困难。

Although manual access path selection was able to make the most efficient use of the very limited hardware capabilities in the 1970s (such as tape drives, whose seeks are extremely slow), the problem was that they made the code for querying and updating the database complicated and inflexible. With both the hierarchical and the network model, if you didn’t have a path to the data you wanted, you were in a difficult situation. You could change the access paths, but then you had to go through a lot of handwritten database query code and rewrite it to handle the new access paths. It was difficult to make changes to an application’s data model.

关系模型

The relational model

相比之下,关系模型所做的是将所有数据放在开放的位置:关系(表)只是元组(行)的集合,仅此而已。如果您想查看数据,没有迷宫般的嵌套结构,也没有复杂的访问路径。您可以读取表中的任何或所有行,选择与任意条件匹配的行。您可以通过将某些列指定为键并匹配这些列来读取特定行。您可以将新行插入到任何表中,而不必担心与其他表之间的外键关系。四号

What the relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), and that’s it. There are no labyrinthine nested structures, no complicated access paths to follow if you want to look at the data. You can read any or all of the rows in a table, selecting those that match an arbitrary condition. You can read a particular row by designating some columns as a key and matching on those. You can insert a new row into any table without worrying about foreign key relationships to and from other tables.iv

在关系数据库中,查询优化器自动决定查询的哪些部分以何种顺序执行,以及使用哪些索引。这些选择实际上是“访问路径”,但最大的区别在于它们是由查询优化器而不是应用程序开发人员自动做出的,因此我们很少需要考虑它们。

In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use. Those choices are effectively the “access path,” but the big difference is that they are made automatically by the query optimizer, not by the application developer, so we rarely need to think about them.

如果您想以新的方式查询数据,只需声明一个新索引,查询将自动使用最合适的索引。您无需更改查询即可利用新索引。(另请参见“数据查询语言”。)因此,关系模型使得向应用程序添加新功能变得更加容易。

If you want to query your data in new ways, you can just declare a new index, and queries will automatically use whichever indexes are most appropriate. You don’t need to change your queries to take advantage of a new index. (See also “Query Languages for Data”.) The relational model thus made it much easier to add new features to applications.

关系数据库的查询优化器是复杂的野兽,它们耗费了多年的研究和开发工作[ 18 ]。但关系模型的一个关键见解是:您只需要构建一次查询优化器,然后所有使用数据库的应用程序都可以从中受益。如果您没有查询优化器,则手动编码特定查询的访问路径比编写通用优化器更容易,但从长远来看,通用解决方案会获胜。

Query optimizers for relational databases are complicated beasts, and they have consumed many years of research and development effort [18]. But a key insight of the relational model was this: you only need to build a query optimizer once, and then all applications that use the database can benefit from it. If you don’t have a query optimizer, it’s easier to handcode the access paths for a particular query than to write a general-purpose optimizer—but the general-purpose solution wins in the long run.

与文档数据库的比较

Comparison to document databases

文档数据库在一方面恢复了分层模型:将嵌套记录(一对多关系,如图2-1中 的positionseducation和)存储在其父记录中,而不是存储在单独的表中。contact_info

Document databases reverted back to the hierarchical model in one aspect: storing nested records (one-to-many relationships, like positions, education, and contact_info in Figure 2-1) within their parent record rather than in a separate table.

然而,当涉及到表示多对一和多对多关系时,关系数据库和文档数据库并没有本质上的不同:在这两种情况下,相关项都由唯一标识符引用,该标识符在数据库中称为外键关系模型和文档模型中的文档引用[ 9 ]。该标识符在读取时通过使用联接或后续查询来解析。迄今为止,文档数据库还没有遵循 CODASYL 的道路。

However, when it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model [9]. That identifier is resolved at read time by using a join or follow-up queries. To date, document databases have not followed the path of CODASYL.

当今的关系数据库与文档数据库

Relational Versus Document Databases Today

在将关系数据库与文档数据库进行比较时,需要考虑许多差异,包括它们的容错属性(请参阅第 5 章)和并发处理(请参阅 第 7 章)。在本章中,我们将仅关注数据模型的差异。

There are many differences to consider when comparing relational databases to document databases, including their fault-tolerance properties (see Chapter 5) and handling of concurrency (see Chapter 7). In this chapter, we will concentrate only on the differences in the data model.

支持文档数据模型的主要论点是模式灵活性、由于局部性而带来的更好的性能,以及对于某些应用程序来说它更接近应用程序使用的数据结构。关系模型通过为连接、多对一和多对多关系提供更好的支持来应对。

The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.

哪种数据模型可以使应用程序代码更简单?

Which data model leads to simpler application code?

如果应用程序中的数据具有类似文档的结构(即,一对多关系的树,通常一次加载整个树),那么使用文档模型可能是个好主意。粉碎的关系技术——将类似文档的结构拆分为多个表(如图2-1中 的positionseducation和)——可能会导致繁琐的模式和不必要的复杂应用程序代码。contact_info

If the data in your application has a document-like structure (i.e., a tree of one-to-many relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to use a document model. The relational technique of shredding—splitting a document-like structure into multiple tables (like positions, education, and contact_info in Figure 2-1)—can lead to cumbersome schemas and unnecessarily complicated application code.

文档模型有局限性:例如,您不能直接引用文档中的嵌套项目,而是需要说类似“用户 251 的职位列表中的第二项”之类的内容(很像层次模型)。然而,只要文档嵌套得不太深,这通常不是问题。

The document model has limitations: for example, you cannot refer directly to a nested item within a document, but instead you need to say something like “the second item in the list of positions for user 251” (much like an access path in the hierarchical model). However, as long as documents are not too deeply nested, that is not usually a problem.

对文档数据库中联接的较差支持可能会也可能不会成为问题,具体取决于应用程序。例如,在使用文档数据库记录哪些事件在何时发生的分析应用程序中可能永远不需要多对多关系[ 19 ]。

The poor support for joins in document databases may or may not be a problem, depending on the application. For example, many-to-many relationships may never be needed in an analytics application that uses a document database to record which events occurred at which time [19].

但是,如果您的应用程序确实使用多对多关系,则文档模型的吸引力就会降低。可以通过非规范化来减少联接需求,但应用程序代码需要执行额外的工作来保持非规范化数据的一致性。可以通过向数据库发出多个请求来在应用程序代码中模拟联接,但这也会将复杂性转移到应用程序中,并且通常比数据库内的专门代码执行的联接慢。在这种情况下,使用文档模型可能会导致应用程序代码变得更加复杂和性能更差[ 15 ]。

However, if your application does use many-to-many relationships, the document model becomes less appealing. It’s possible to reduce the need for joins by denormalizing, but then the application code needs to do additional work to keep the denormalized data consistent. Joins can be emulated in application code by making multiple requests to the database, but that also moves complexity into the application and is usually slower than a join performed by specialized code inside the database. In such cases, using a document model can lead to significantly more complex application code and worse performance [15].

一般来说,不可能说哪种数据模型会导致更简单的应用程序代码;它取决于数据项之间存在的关系类型。对于高度互连的数据,文档模型很尴尬,关系模型还可以接受,而图模型(参见 “类图数据模型”)是最自然的。

It’s not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models (see “Graph-Like Data Models”) are the most natural.

文档模型中的架构灵活性

Schema flexibility in the document model

大多数文档数据库以及关系数据库中的 JSON 支持不会对文档中的数据强制执行任何架构。关系数据库中的 XML 支持通常附带可选的架构验证。没有模式意味着可以将任意键和值添加到文档中,并且在读取时,客户端无法保证文档可能包含哪些字段。

Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain.

文档数据库有时被称为无模式,但这具有误导性,因为读取数据的代码通常假设某种结构,即存在隐式模式,但数据库并不强制执行它[20 ]。更准确的术语是“读取时模式”(数据结构是隐式的,仅在读取数据时进行解释),与“ 写入时模式”(关系数据库的传统方法,其中模式是显式的)形成鲜明对比。并且数据库确保所有写入的数据都符合它)[ 21 ]。

Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database [20]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) [21].

读取时模式类似于编程语言中的动态(运行时)类型检查,而写入时模式类似于静态(编译时)类型检查。正如静态和动态类型检查的倡导者对它们的相对优点进行了大辩论一样[ 22 ],数据库中模式的实施是一个有争议的话题,而且一般来说没有正确或错误的答案。

Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static and dynamic type checking have big debates about their relative merits [22], enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong answer.

在应用程序想要更改其数据格式的情况下,这些方法之间的差异尤其明显。例如,假设您当前将每个用户的全名存储在一个字段中,而您希望分别存储名字和姓氏 [ 23 ]。在文档数据库中,您只需开始使用新字段编写新文档,并在应用程序中添加代码来处理读取旧文档时的情况。例如:

The difference between the approaches is particularly noticeable in situations where an application wants to change the format of its data. For example, say you are currently storing each user’s full name in one field, and you instead want to store the first name and last name separately [23]. In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. For example:

if (user && user.name && !user.first_name) {
    // Documents written before Dec 8, 2013 don't have first_name
    user.first_name = user.name.split(" ")[0];
}
if (user && user.name && !user.first_name) {
    // Documents written before Dec 8, 2013 don't have first_name
    user.first_name = user.name.split(" ")[0];
}

另一方面,在“静态类型”数据库模式中,您通常会 按照以下方式执行迁移:

On the other hand, in a “statically typed” database schema, you would typically perform a migration along the lines of:

ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1);      -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1);      -- MySQL
ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1);      -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1);      -- MySQL

架构更改因缓慢且需要停机而享有盛誉。这种声誉并不完全是当之无愧的:大多数关系数据库系统ALTER TABLE在几毫秒内执行该语句。MySQL 是一个值得注意的例外 - 它会复制 上的整个表,这可能意味着更改大型表 ALTER TABLE时需要几分钟甚至几小时的停机时间 - 尽管存在各种工具来解决此限制 [ 24,25,26 ]

Schema changes have a bad reputation of being slow and requiring downtime. This reputation is not entirely deserved: most relational database systems execute the ALTER TABLE statement in a few milliseconds. MySQL is a notable exception—it copies the entire table on ALTER TABLE, which can mean minutes or even hours of downtime when altering a large table—although various tools exist to work around this limitation [24, 25, 26].

在任何数据库上,在大型表上运行该UPDATE语句可能会很慢,因为每一行都需要重写。如果这是不可接受的,应用程序可以将设置first_name保留为默认值NULL并在读取时填充它,就像使用文档数据库一样。

Running the UPDATE statement on a large table is likely to be slow on any database, since every row needs to be rewritten. If that is not acceptable, the application can leave first_name set to its default of NULL and fill it in at read time, like it would with a document database.

如果集合中的项目由于某种原因(即数据是异构的)并不全部具有相同的结构,则读取模式方法是有利的,例如,因为:

The schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason (i.e., the data is heterogeneous)—for example, because:

  • 有许多不同类型的对象,将每种类型的对象放在自己的表中是不切实际的。

  • There are many different types of objects, and it is not practical to put each type of object in its own table.

  • 数据的结构由外部系统决定,您无法控制这些系统,并且可能随时发生变化。

  • The structure of the data is determined by external systems over which you have no control and which may change at any time.

在这样的情况下,模式可能弊大于利,而无模式文档可以是更自然的数据模型。但在所有记录都应具有相同结构的情况下,模式是记录和强制执行该结构的有用机制。我们将在第 4 章中更详细地讨论模式和模式演化。

In situations like these, a schema may hurt more than it helps, and schemaless documents can be a much more natural data model. But in cases where all records are expected to have the same structure, schemas are a useful mechanism for documenting and enforcing that structure. We will discuss schemas and schema evolution in more detail in Chapter 4.

查询的数据局部性

Data locality for queries

文档通常存储为单个连续字符串,编码为 JSON、XML 或其二进制变体(例如 MongoDB 的 BSON)。如果您的应用程序经常需要访问整个文档(例如,将其呈现在网页上),则此存储位置具有性能优势。如果数据分布在多个表中(如图2-1 所示),则需要多个索引查找来检索全部数据,这可能需要更多磁盘查找并花费更多时间。

A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON). If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality. If data is split across multiple tables, like in Figure 2-1, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.

仅当您同时需要文档的大部分内容时,局部性优势才适用。数据库通常需要加载整个文档,即使您只访问其中的一小部分,这对于大型文档来说可能会造成浪费。在更新文档时,通常需要重写整个文档——只有不改变文档编码大小的修改才能轻松就地执行[ 19 ]。由于这些原因,通常建议您将文档保持得相当小,并避免增加文档大小的写入 [ 9 ]。这些性能限制大大减少了文档数据库有用的情况。

The locality advantage only applies if you need large parts of the document at the same time. The database typically needs to load the entire document, even if you access only a small portion of it, which can be wasteful on large documents. On updates to a document, the entire document usually needs to be rewritten—only modifications that don’t change the encoded size of a document can easily be performed in place [19]. For these reasons, it is generally recommended that you keep documents fairly small and avoid writes that increase the size of a document [9]. These performance limitations significantly reduce the set of situations in which document databases are useful.

值得指出的是,将相关数据分组在一起以实现局部性的想法并不限于文档模型。例如,Google 的 Spanner 数据库通过允许架构声明表的行应在父表中交错(嵌套),在关系数据模型中提供相同的局部性属性 [27 ]。Oracle 也允许这样做,使用一种称为多表索引簇表的功能 [ 28 ]。Bigtable 数据模型(用于 Cassandra 和 HBase)中的列族概念具有管理局部性的类似目的[ 29 ]

It’s worth pointing out that the idea of grouping related data together for locality is not limited to the document model. For example, Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table [27]. Oracle allows the same, using a feature called multi-table index cluster tables [28]. The column-family concept in the Bigtable data model (used in Cassandra and HBase) has a similar purpose of managing locality [29].

我们还将在第 3 章中看到更多关于局部性的内容。

We will also see more on locality in Chapter 3.

文档数据库和关系数据库的融合

Convergence of document and relational databases

自 2000 年代中期以来,大多数关系数据库系统(MySQL 除外)都支持 XML。这包括对 XML 文档进行本地修改的功能以及在 XML 文档内部进行索引和查询的功能,这使得应用程序可以使用与使用文档数据库时非常相似的数据模型。

Most relational database systems (other than MySQL) have supported XML since the mid-2000s. This includes functions to make local modifications to XML documents and the ability to index and query inside XML documents, which allows applications to use data models very similar to what they would do when using a document database.

PostgreSQL 自版本 9.3 [ 8 ]、MySQL 自版本 5.7 和 IBM DB2 自版本 10.5 [ 30 ] 也对 JSON 文档具有类似级别的支持。鉴于 JSON 在 Web API 中的流行,其他关系数据库很可能会效仿他们的脚步并添加 JSON 支持。

PostgreSQL since version 9.3 [8], MySQL since version 5.7, and IBM DB2 since version 10.5 [30] also have a similar level of support for JSON documents. Given the popularity of JSON for web APIs, it is likely that other relational databases will follow in their footsteps and add JSON support.

在文档数据库方面,RethinkDB 在其查询语言中支持类似关系的联接,并且一些 MongoDB 驱动程序会自动解析数据库引用(有效地执行客户端联接,尽管这可能比在数据库中执行的联接慢,因为它需要额外的网络往返并且优化程度较低)。

On the document database side, RethinkDB supports relational-like joins in its query language, and some MongoDB drivers automatically resolve database references (effectively performing a client-side join, although this is likely to be slower than a join performed in the database since it requires additional network round-trips and is less optimized).

随着时间的推移,关系数据库和文档数据库似乎变得越来越相似,这是一件好事:数据模型相互补充。v如果数据库能够处理类似文档的数据并对其执行关系查询,那么应用程序就可以使用最适合其需求的功能组合。

It seems that relational and document databases are becoming more similar over time, and that is a good thing: the data models complement each other.v If a database is able to handle document-like data and also perform relational queries on it, applications can use the combination of features that best fits their needs.

关系模型和文档模型的混合是数据库未来的一个很好的发展方向。

A hybrid of the relational and document models is a good route for databases to take in the future.

数据查询语言

Query Languages for Data

当关系模型被引入时,它包含了一种新的数据查询方式:SQL 是一种 声明式查询语言,而 IMS 和 CODASYL 使用命令式代码查询数据库。这意味着什么?

When the relational model was introduced, it included a new way of querying data: SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code. What does that mean?

许多常用的编程语言都是命令式的。例如,如果您有一个动物物种列表,您可以编写如下代码以仅返回列表中的鲨鱼:

Many commonly used programming languages are imperative. For example, if you have a list of animal species, you might write something like this to return only the sharks in the list:

function getSharks() {
    var sharks = [];
    for (var i = 0; i < animals.length; i++) {
        if (animals[i].family === "Sharks") {
            sharks.push(animals[i]);
        }
    }
    return sharks;
}
function getSharks() {
    var sharks = [];
    for (var i = 0; i < animals.length; i++) {
        if (animals[i].family === "Sharks") {
            sharks.push(animals[i]);
        }
    }
    return sharks;
}

在关系代数中,你可以写:

In the relational algebra, you would instead write:

鲨鱼 = σ家族 = “鲨鱼”(动物)

sharks  =  σfamily = “Sharks” (animals)

其中 σ(希腊字母 sigma)是选择运算符,仅返回那些与条件family = “Sharks”匹配的动物。

where σ (the Greek letter sigma) is the selection operator, returning only those animals that match the condition family = “Sharks”.

当 SQL 被定义时,它相当紧密地遵循关系代数的结构:

When SQL was defined, it followed the structure of the relational algebra fairly closely:

SELECT * FROM animals WHERE family = 'Sharks';
SELECT * FROM animals WHERE family = 'Sharks';

命令式语言告诉计算机按特定顺序执行特定操作。您可以想象逐行单步执行代码,评估条件,更新变量,并决定是否再循环一次。

An imperative language tells the computer to perform certain operations in a certain order. You can imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.

在声明性查询语言中,例如 SQL 或关系代数,您只需指定所需数据的模式 - 结果必须满足哪些条件,以及您希望如何转换数据(例如,排序、分组和聚合) -但不知道如何实现这一目标。由数据库系统的查询优化器来决定使用哪些索引和哪些连接方法,以及以什么顺序执行查询的各个部分。

In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.

声明性查询语言很有吸引力,因为它通常比命令式 API 更简洁且更易于使用。但更重要的是,它还隐藏了数据库引擎的实现细节,这使得数据库系统无需对查询进行任何更改即可引入性能改进。

A declarative query language is attractive because it is typically more concise and easier to work with than an imperative API. But more importantly, it also hides implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.

例如,在本节开头所示的命令式代码中,动物列表以特定顺序出现。如果数据库想要在后台回收未使用的磁盘空间,它可能需要移动记录,更改动物出现的顺序。数据库可以安全地做到这一点,而不破坏查询吗?

For example, in the imperative code shown at the beginning of this section, the list of animals appears in a particular order. If the database wants to reclaim unused disk space behind the scenes, it might need to move records around, changing the order in which the animals appear. Can the database do that safely, without breaking queries?

SQL 示例不保证任何特定的顺序,因此它不介意顺序是否发生变化。但是,如果查询被编写为命令式代码,则数据库永远无法确定代码是否依赖于顺序。SQL 的功能更加有限,这一事实为数据库提供了更多自动优化的空间。

The SQL example doesn’t guarantee any particular ordering, and so it doesn’t mind if the order changes. But if the query is written as imperative code, the database can never be sure whether the code is relying on the ordering or not. The fact that SQL is more limited in functionality gives the database much more room for automatic optimizations.

最后,声明性语言通常适合并行执行。如今,CPU 的速度变得更快是通过添加更多内核,而不是通过以比以前更高的时钟速度运行 [ 31 ]。命令式代码很难跨多个核心和多台机器并行化,因为它指定必须以特定顺序执行的指令。声明性语言更有可能在并行执行中变得更快,因为它们仅指定结果的模式,而不指定用于确定结果的算法。如果合适的话,数据库可以自由地使用查询语言的并行实现[ 32 ]。

Finally, declarative languages often lend themselves to parallel execution. Today, CPUs are getting faster by adding more cores, not by running at significantly higher clock speeds than before [31]. Imperative code is very hard to parallelize across multiple cores and multiple machines, because it specifies instructions that must be performed in a particular order. Declarative languages have a better chance of getting faster in parallel execution because they specify only the pattern of the results, not the algorithm that is used to determine the results. The database is free to use a parallel implementation of the query language, if appropriate [32].

Web 上的声明式查询

Declarative Queries on the Web

声明性查询语言的优点不仅仅限于数据库。为了说明这一点,让我们在完全不同的环境(Web 浏览器)中比较声明式方法和命令式方法。

The advantages of declarative query languages are not limited to just databases. To illustrate the point, let’s compare declarative and imperative approaches in a completely different environment: a web browser.

假设您有一个有关海洋动物的网站。用户当前正在查看有关鲨鱼的页面,因此您将导航项“Sharks”标记为当前选定的,如下所示:

Say you have a website about animals in the ocean. The user is currently viewing the page on sharks, so you mark the navigation item “Sharks” as currently selected, like this:

<ul>
    <li class="selected"> 1
        <p>Sharks</p> 2
        <ul>
            <li>Great White Shark</li>
            <li>Tiger Shark</li>
            <li>Hammerhead Shark</li>
        </ul>
    </li>
    <li>
        <p>Whales</p>
        <ul>
            <li>Blue Whale</li>
            <li>Humpback Whale</li>
            <li>Fin Whale</li>
        </ul>
    </li>
</ul>
<ul>
    <li class="selected"> 
        <p>Sharks</p> 
        <ul>
            <li>Great White Shark</li>
            <li>Tiger Shark</li>
            <li>Hammerhead Shark</li>
        </ul>
    </li>
    <li>
        <p>Whales</p>
        <ul>
            <li>Blue Whale</li>
            <li>Humpback Whale</li>
            <li>Fin Whale</li>
        </ul>
    </li>
</ul>
1

所选项目标有 CSS 类"selected"

The selected item is marked with the CSS class "selected".

2

<p>Sharks</p>是当前所选页面的标题。

<p>Sharks</p> is the title of the currently selected page.

现在假设您希望当前所选页面的标题具有蓝色背景,以便在视觉上突出显示。这很简单,使用 CSS:

Now say you want the title of the currently selected page to have a blue background, so that it is visually highlighted. This is easy, using CSS:

li.selected > p {
    background-color: blue;
}
li.selected > p {
    background-color: blue;
}

这里 CSS 选择器li.selected > p声明了我们要应用蓝色样式的元素模式:即,<p>其直接父级是<li>CSS 类为 的元素的所有元素selected<p>Sharks</p>示例中的元素与此模式匹配,但<p>Whales</p> 不匹配,因为<li>其父元素缺少class="selected"

Here the CSS selector li.selected > p declares the pattern of elements to which we want to apply the blue style: namely, all <p> elements whose direct parent is an <li> element with a CSS class of selected. The element <p>Sharks</p> in the example matches this pattern, but <p>Whales</p> does not match because its <li> parent lacks class="selected".

如果您使用 XSL 而不是 CSS,您可以执行类似的操作:

If you were using XSL instead of CSS, you could do something similar:

<xsl:template match="li[@class='selected']/p">
    <fo:block background-color="blue">
        <xsl:apply-templates/>
    </fo:block>
</xsl:template>
<xsl:template match="li[@class='selected']/p">
    <fo:block background-color="blue">
        <xsl:apply-templates/>
    </fo:block>
</xsl:template>

这里,XPath 表达式li[@class='selected']/p相当于li.selected > p前面示例中的 CSS 选择器。CSS 和 XSL 的共同点是它们都是用于指定文档样式的声明性语言。

Here, the XPath expression li[@class='selected']/p is equivalent to the CSS selector li.selected > p in the previous example. What CSS and XSL have in common is that they are both declarative languages for specifying the styling of a document.

想象一下,如果您必须使用命令式方法,生活会是什么样子。在 JavaScript 中,使用核心文档对象模型 (DOM) API,结果可能如下所示:

Imagine what life would be like if you had to use an imperative approach. In JavaScript, using the core Document Object Model (DOM) API, the result might look something like this:

var liElements = document.getElementsByTagName("li");
for (var i = 0; i < liElements.length; i++) {
    if (liElements[i].className === "selected") {
        var children = liElements[i].childNodes;
        for (var j = 0; j < children.length; j++) {
            var child = children[j];
            if (child.nodeType === Node.ELEMENT_NODE && child.tagName === "P") {
                child.setAttribute("style", "background-color: blue");
            }
        }
    }
}
var liElements = document.getElementsByTagName("li");
for (var i = 0; i < liElements.length; i++) {
    if (liElements[i].className === "selected") {
        var children = liElements[i].childNodes;
        for (var j = 0; j < children.length; j++) {
            var child = children[j];
            if (child.nodeType === Node.ELEMENT_NODE && child.tagName === "P") {
                child.setAttribute("style", "background-color: blue");
            }
        }
    }
}

这段 JavaScript 强制将元素设置<p>Sharks</p>为蓝色背景,但代码很糟糕。它不仅比 CSS 和 XSL 等价物更长、更难理解,而且还存在一些严重的问题:

This JavaScript imperatively sets the element <p>Sharks</p> to have a blue background, but the code is awful. Not only is it much longer and harder to understand than the CSS and XSL equivalents, but it also has some serious problems:

  • 如果selected删除该类(例如,因为用户单击不同的页面),则即使重新运行代码,蓝色也不会被删除,因此该项目将保持突出显示状态,直到重新加载整个页面。使用CSS,浏览器会自动检测li.selected > p 规则何时不再适用,并在selected删除类后立即删除蓝色背景。

  • If the selected class is removed (e.g., because the user clicks a different page), the blue color won’t be removed, even if the code is rerun—and so the item will remain highlighted until the entire page is reloaded. With CSS, the browser automatically detects when the li.selected > p rule no longer applies and removes the blue background as soon as the selected class is removed.

  • 如果您想利用新的 API,例如document.getElementsByClassName("selected") 或什至document.evaluate()(这可能会提高性能),您必须重写代码。另一方面,浏览器供应商可以在不破坏兼容性的情况下提高 CSS 和 XPath 的性能。

  • If you want to take advantage of a new API, such as document.getElementsByClassName("selected") or even document.evaluate()—which may improve performance—you have to rewrite the code. On the other hand, browser vendors can improve the performance of CSS and XPath without breaking compatibility.

在 Web 浏览器中,使用声明性 CSS 样式比在 JavaScript 中命令式操作样式要好得多。同样,在数据库中,SQL 等声明式查询语言比命令式查询 API 好得多。

In a web browser, using declarative CSS styling is much better than manipulating styles imperatively in JavaScript. Similarly, in databases, declarative query languages like SQL turned out to be much better than imperative query APIs.vi

MapReduce查询

MapReduce Querying

MapReduce是一种用于跨多台机器批量处理大量数据的编程模型,由 Google 推广[ 33 ]。一些 NoSQL 数据存储(包括 MongoDB 和 CouchDB)支持有限形式的 MapReduce,作为跨多个文档执行只读查询的机制。

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google [33]. A limited form of MapReduce is supported by some NoSQL datastores, including MongoDB and CouchDB, as a mechanism for performing read-only queries across many documents.

一般而言,MapReduce 在第 10 章中有更详细的描述。现在,我们将简要讨论 MongoDB 对模型的使用。

MapReduce in general is described in more detail in Chapter 10. For now, we’ll just briefly discuss MongoDB’s use of the model.

MapReduce 既不是声明式查询语言,也不是完全命令式查询 API,而是介于两者之间:查询的逻辑用代码片段表示,由处理框架重复调用。它基于许多函数式编程语言中存在的map(也称为collect) 和reduce(也称为fold或) 函数。inject

MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.

举个例子,假设您是一名海洋生物学家,每次看到海洋中的动物时,您都会在数据库中添加一条观察记录。现在您想要生成一份报告,说明您每月看到的鲨鱼数量。

To give an example, imagine you are a marine biologist, and you add an observation record to your database every time you see animals in the ocean. Now you want to generate a report saying how many sharks you have sighted per month.

在 PostgreSQL 中,您可以像这样表达该查询:

In PostgreSQL you might express that query like this:

SELECT date_trunc('month', observation_timestamp) AS observation_month, 1
       sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;
SELECT date_trunc('month', observation_timestamp) AS observation_month, 
       sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;
1

date_trunc('month', timestamp)函数确定包含 的日历月timestamp,并返回表示该月开始的另一个时间戳。换句话说,它将时间戳向下舍入到最近的月份。

The date_trunc('month', timestamp) function determines the calendar month containing timestamp, and returns another timestamp representing the beginning of that month. In other words, it rounds a timestamp down to the nearest month.

该查询首先过滤观测值,仅显示科中的物种Sharks,然后按观测值发生的日历月份对观测值进行分组,最后将该月所有观测值中看到的动物数量相加。

This query first filters the observations to only show species in the Sharks family, then groups the observations by the calendar month in which they occurred, and finally adds up the number of animals seen in all observations in that month.

同样的情况也可以用 MongoDB 的 MapReduce 功能来表达,如下所示:

The same can be expressed with MongoDB’s MapReduce feature as follows:

db.observations.mapReduce(
    function map() { 2
        var year  = this.observationTimestamp.getFullYear();
        var month = this.observationTimestamp.getMonth() + 1;
        emit(year + "-" + month, this.numAnimals); 3
    },
    function reduce(key, values) { 4
        return Array.sum(values); 5
    },
    {
        query: { family: "Sharks" }, 1
        out: "monthlySharkReport" 6
    }
);
db.observations.mapReduce(
    function map() { 
        var year  = this.observationTimestamp.getFullYear();
        var month = this.observationTimestamp.getMonth() + 1;
        emit(year + "-" + month, this.numAnimals); 
    },
    function reduce(key, values) { 
        return Array.sum(values); 
    },
    {
        query: { family: "Sharks" }, 
        out: "monthlySharkReport" 
    }
);
1

可以以声明方式指定仅考虑鲨鱼种类的过滤器(这是 MapReduce 的 MongoDB 特定扩展)。

The filter to consider only shark species can be specified declaratively (this is a MongoDB-specific extension to MapReduce).

2

对于每个匹配的文档, JavaScript 函数map都会被调用一次query,并且 this设置为文档对象。

The JavaScript function map is called once for every document that matches query, with this set to the document object.

3

map函数发出一个键(由年和月组成的字符串,例如"2013-12""2014-1")和一个值(该观察中的动物数量)。

The map function emits a key (a string consisting of year and month, such as "2013-12" or "2014-1") and a value (the number of animals in that observation).

4

发出的键值对map按键分组。对于具有相同键(即相同月份和年份)的所有键值对,该reduce函数被调用一次。

The key-value pairs emitted by map are grouped by key. For all key-value pairs with the same key (i.e., the same month and year), the reduce function is called once.

5

reduce函数将特定月份所有观察到的动物数量相加。

The reduce function adds up the number of animals from all observations in a particular month.

6

最终输出写入集合中monthlySharkReport

The final output is written to the collection monthlySharkReport.

例如,假设observations集合包含这两个文档:

For example, say the observations collection contains these two documents:

{
    observationTimestamp: Date.parse("Mon, 25 Dec 1995 12:34:56 GMT"),
    family:     "Sharks",
    species:    "Carcharodon carcharias",
    numAnimals: 3
}
{
    observationTimestamp: Date.parse("Tue, 12 Dec 1995 16:17:18 GMT"),
    family:     "Sharks",
    species:    "Carcharias taurus",
    numAnimals: 4
}
{
    observationTimestamp: Date.parse("Mon, 25 Dec 1995 12:34:56 GMT"),
    family:     "Sharks",
    species:    "Carcharodon carcharias",
    numAnimals: 3
}
{
    observationTimestamp: Date.parse("Tue, 12 Dec 1995 16:17:18 GMT"),
    family:     "Sharks",
    species:    "Carcharias taurus",
    numAnimals: 4
}

map函数将为每个文档调用一次,结果是 emit("1995-12", 3)emit("1995-12", 4)。随后,该reduce函数将被调用并reduce("1995-12", [3, 4])返回 7

The map function would be called once for each document, resulting in emit("1995-12", 3) and emit("1995-12", 4). Subsequently, the reduce function would be called with reduce("1995-12", [3, 4]), returning 7.

和函数在允许执行的操作方面受到一定限制mapreduce它们必须是 函数,这意味着它们仅使用传递给它们的数据作为输入,它们不能执行额外的数据库查询,并且不能有任何副作用。这些限制允许数据库在任何地方、以任何顺序运行这些函数,并在失败时重新运行它们。然而,它们仍然很强大:它们可以解析字符串、调用库函数、执行计算等等。

The map and reduce functions are somewhat restricted in what they are allowed to do. They must be pure functions, which means they only use the data that is passed to them as input, they cannot perform additional database queries, and they must not have any side effects. These restrictions allow the database to run the functions anywhere, in any order, and rerun them on failure. However, they are nevertheless powerful: they can parse strings, call library functions, perform calculations, and more.

MapReduce 是一个相当低级的编程模型,用于在机器集群上分布式执行。像 SQL 这样的高级查询语言可以实现为 MapReduce 操作的管道(参见第 10 章),但是也有许多不使用 MapReduce 的 SQL 分布式实现。请注意,SQL 中没有任何内容限制它在单台机器上运行,并且 MapReduce 并不垄断分布式查询执行。

MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines. Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations (see Chapter 10), but there are also many distributed implementations of SQL that don’t use MapReduce. Note there is nothing in SQL that constrains it to running on a single machine, and MapReduce doesn’t have a monopoly on distributed query execution.

能够在查询中间使用 JavaScript 代码对于高级查询来说是一个很棒的功能,但它不仅限于 MapReduce — 一些 SQL 数据库也可以使用 JavaScript 函数进行扩展 [34 ]

Being able to use JavaScript code in the middle of a query is a great feature for advanced queries, but it’s not limited to MapReduce—some SQL databases can be extended with JavaScript functions too [34].

MapReduce 的一个可用性问题是您必须编写两个仔细协调的 JavaScript 函数,这通常比编写单个查询更困难。此外,声明性查询语言为查询优化器提供了更多提高查询性能的机会。由于这些原因,MongoDB 2.2 添加了对称为聚合管道的声明式查询语言的支持 [ 9 ]。在这种语言中,相同的鲨鱼计数查询如下所示:

A usability problem with MapReduce is that you have to write two carefully coordinated JavaScript functions, which is often harder than writing a single query. Moreover, a declarative query language offers more opportunities for a query optimizer to improve the performance of a query. For these reasons, MongoDB 2.2 added support for a declarative query language called the aggregation pipeline [9]. In this language, the same shark-counting query looks like this:

db.observations.aggregate([
    { $match: { family: "Sharks" } },
    { $group: {
        _id: {
            year:  { $year:  "$observationTimestamp" },
            month: { $month: "$observationTimestamp" }
        },
        totalAnimals: { $sum: "$numAnimals" }
    } }
]);
db.observations.aggregate([
    { $match: { family: "Sharks" } },
    { $group: {
        _id: {
            year:  { $year:  "$observationTimestamp" },
            month: { $month: "$observationTimestamp" }
        },
        totalAnimals: { $sum: "$numAnimals" }
    } }
]);

聚合管道语言在表达能力上类似于 SQL 的子集,但它使用基于 JSON 的语法,而不是 SQL 的英语句子式语法;差异也许是品味问题。这个故事的寓意是,NoSQL 系统可能会发现自己意外地重新发明了 SQL,尽管是伪装的。

The aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses a JSON-based syntax rather than SQL’s English-sentence-style syntax; the difference is perhaps a matter of taste. The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise.

类图数据模型

Graph-Like Data Models

我们之前看到,多对多关系是不同数据模型之间的一个重要区分特征。如果您的应用程序主要具有一对多关系(树结构数据)或记录之间没有关系,则文档模型是合适的。

We saw earlier that many-to-many relationships are an important distinguishing feature between different data models. If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.

但是,如果多对多关系在您的数据中非常常见怎么办?关系模型可以处理多对多关系的简单情况,但随着数据内的连接变得更加复杂,开始将数据建模为图形就变得更加自然。

But what if many-to-many relationships are very common in your data? The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.

图由两种对象组成:顶点(也称为节点实体)和 (也称为关系)。许多类型的数据都可以建模为图表。典型例子包括:

A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). Many kinds of data can be modeled as a graph. Typical examples include:

社交图谱
Social graphs

顶点是人,边表示哪些人互相认识。

Vertices are people, and edges indicate which people know each other.

网络图
The web graph

顶点是网页,边表示指向其他页面的 HTML 链接。

Vertices are web pages, and edges indicate HTML links to other pages.

公路或铁路网络
Road or rail networks

顶点是交汇点,边代表它们之间的道路或铁路线。

Vertices are junctions, and edges represent the roads or railway lines between them.

众所周知的算法可以在这些图上运行:例如,汽车导航系统搜索道路网络中两点之间的最短路径,而PageRank可以在网络图上使用来确定网页的受欢迎程度,从而确定其排名在搜索结果中。

Well-known algorithms can operate on these graphs: for example, car navigation systems search for the shortest path between two points in a road network, and PageRank can be used on the web graph to determine the popularity of a web page and thus its ranking in search results.

在刚刚给出的示例中,图中的所有顶点都代表同一类事物(分别是人、网页或路口)。然而,图并不局限于这种同构数据:图的一个同样强大的用途是提供在单个数据存储中存储完全不同类型的对象的一致方法。例如,Facebook 维护一个包含许多不同类型的顶点和边的单一图:顶点代表人物、位置、事件、签到和用户发表的评论;边缘表示哪些人彼此是朋友、哪个签到发生在哪个位置、谁评论了哪个帖子、谁参加了哪个活动等等[35 ]

In the examples just given, all the vertices in a graph represent the same kind of thing (people, web pages, or road junctions, respectively). However, graphs are not limited to such homogeneous data: an equally powerful use of graphs is to provide a consistent way of storing completely different types of objects in a single datastore. For example, Facebook maintains a single graph with many different types of vertices and edges: vertices represent people, locations, events, checkins, and comments made by users; edges indicate which people are friends with each other, which checkin happened in which location, who commented on which post, who attended which event, and so on [35].

在本节中,我们将使用图 2-5 中所示的示例。它可以取自社交网络或家谱数据库:它显示了两个人,来自爱达荷州的露西和来自法国博纳的阿兰。他们已婚并住在伦敦。

In this section we will use the example shown in Figure 2-5. It could be taken from a social network or a genealogical database: it shows two people, Lucy from Idaho and Alain from Beaune, France. They are married and living in London.

迪迪亚0205
图 2-5。图结构数据示例(方框代表顶点,箭头代表边)。

有几种不同但相关的方式来构造和查询图中的数据。在本节中,我们将讨论属性图模型(由 Neo4j、Titan 和 InfiniteGraph 实现)和三重存储模型(由 Datomic、AllegroGraph 等实现)。我们将研究三种图形声明式查询语言:Cypher、SPARQL 和 Datalog。除此之外,还有命令式图查询语言,例如 Gremlin [ 36 ] 和图处理框架,例如 Pregel(参见第 10 章)。

There are several different, but related, ways of structuring and querying data in graphs. In this section we will discuss the property graph model (implemented by Neo4j, Titan, and InfiniteGraph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). We will look at three declarative query languages for graphs: Cypher, SPARQL, and Datalog. Besides these, there are also imperative graph query languages such as Gremlin [36] and graph processing frameworks like Pregel (see Chapter 10).

属性图

Property Graphs

在属性图模型中,每个顶点由以下部分组成:

In the property graph model, each vertex consists of:

  • 唯一标识符

  • A unique identifier

  • 一组出边

  • A set of outgoing edges

  • 一组传入边

  • A set of incoming edges

  • 属性集合(键值对)

  • A collection of properties (key-value pairs)

每条边由以下部分组成:

Each edge consists of:

  • 唯一标识符

  • A unique identifier

  • 边开始的顶点(尾部顶点

  • The vertex at which the edge starts (the tail vertex)

  • 边结束的顶点(头顶点

  • The vertex at which the edge ends (the head vertex)

  • 描述两个顶点之间关系类型的标签

  • A label to describe the kind of relationship between the two vertices

  • 属性集合(键值对)

  • A collection of properties (key-value pairs)

您可以将图存储视为由两个关系表组成,一张用于顶点,一张用于边,如示例 2-2 所示(此模式使用 PostgreSQLjson数据类型来存储每个顶点或边的属性)。为每条边存储头尾顶点;如果您想要某个顶点的入边或出边集合,可以分别edges通过 head_vertex或查询表tail_vertex

You can think of a graph store as consisting of two relational tables, one for vertices and one for edges, as shown in Example 2-2 (this schema uses the PostgreSQL json datatype to store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if you want the set of incoming or outgoing edges for a vertex, you can query the edges table by head_vertex or tail_vertex, respectively.

示例 2-2。使用关系模式表示属性图
CREATE TABLE vertices (
    vertex_id   integer PRIMARY KEY,
    properties  json
);

CREATE TABLE edges (
    edge_id     integer PRIMARY KEY,
    tail_vertex integer REFERENCES vertices (vertex_id),
    head_vertex integer REFERENCES vertices (vertex_id),
    label       text,
    properties  json
);

CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);
CREATE TABLE vertices (
    vertex_id   integer PRIMARY KEY,
    properties  json
);

CREATE TABLE edges (
    edge_id     integer PRIMARY KEY,
    tail_vertex integer REFERENCES vertices (vertex_id),
    head_vertex integer REFERENCES vertices (vertex_id),
    label       text,
    properties  json
);

CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);

该模型的一些重要方面是:

Some important aspects of this model are:

  1. 任何顶点都可以有一条边将其与任何其他顶点连接起来。没有模式限制哪些类型的事物可以关联或不可以关联。

  2. Any vertex can have an edge connecting it with any other vertex. There is no schema that restricts which kinds of things can or cannot be associated.

  3. 给定任何顶点,您可以有效地找到其传入边和传出边,从而 向前和向后遍历图,即沿着通过顶点链的路径。(这就是示例 2-2tail_vertex在和列上都有索引的原因head_vertex 。)

  4. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus traverse the graph—i.e., follow a path through a chain of vertices—both forward and backward. (That’s why Example 2-2 has indexes on both the tail_vertex and head_vertex columns.)

  5. 通过对不同类型的关系使用不同的标签,您可以在单个图表中存储多种不同类型的信息,同时仍然保持干净的数据模型。

  6. By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.

这些功能为图的数据建模提供了很大的灵活性,如图 2-5所示。该图显示了一些在传统关系模式中难以表达的内容,例如不同国家的不同类型的区域结构(法国有省和地区,而美国有县和州历史怪癖,例如国中之国(暂时忽略主权国家和民族的复杂性),以及不同粒度的数据(露西当前居住地被指定为城市,而她的出生地仅在州级别被指定)。

Those features give graphs a great deal of flexibility for data modeling, as illustrated in Figure 2-5. The figure shows a few things that would be difficult to express in a traditional relational schema, such as different kinds of regional structures in different countries (France has départements and régions, whereas the US has counties and states), quirks of history such as a country within a country (ignoring for now the intricacies of sovereign states and nations), and varying granularity of data (Lucy’s current residence is specified as a city, whereas her place of birth is specified only at the level of a state).

您可以想象将图表扩展为还包括有关露西和阿兰或其他人的许多其他事实。例如,您可以使用它来指示他们对任何食物过敏(通过为每种过敏原引入一个顶点,并在人和过敏原之间引入一条边来指示过敏),并将过敏原与一组顶点联系起来,这些顶点显示哪些过敏原食物中含有哪些物质。然后您可以编写一个查询来找出每个人吃什么是安全的。 图有利于可进化性:当您向应用程序添加功能时,可以轻松扩展图以适应应用程序数据结构的变化。

You could imagine extending the graph to also include many other facts about Lucy and Alain, or other people. For instance, you could use it to indicate any food allergies they have (by introducing a vertex for each allergen, and an edge between a person and an allergen to indicate an allergy), and link the allergens with a set of vertices that show which foods contain which substances. Then you could write a query to find out what is safe for each person to eat. Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures.

Cypher 查询语言

The Cypher Query Language

Cypher是一种属性图的声明性查询语言,为 Neo4j 图形数据库创建[ 37 ]。(它以电影《黑客帝国》中的一个角色命名 ,与密码学中的密码无关[ 38 ]。)

Cypher is a declarative query language for property graphs, created for the Neo4j graph database [37]. (It is named after a character in the movie The Matrix and is not related to ciphers in cryptography [38].)

示例 2-3显示了将图 2-5左侧部分插入 到图形数据库中的 Cypher 查询。图表的其余部分可以类似地添加,并且为了可读性而被省略。每个顶点都被赋予一个符号名称,例如USAIdaho,查询的其他部分可以使用这些名称在顶点之间创建边,使用箭头表示法: (Idaho) -[:WITHIN]-> (USA)创建一条标记为 的边WITHIN,作为Idaho尾节点和USA 头节点。

Example 2-3 shows the Cypher query to insert the lefthand portion of Figure 2-5 into a graph database. The rest of the graph can be added similarly and is omitted for readability. Each vertex is given a symbolic name like USA or Idaho, and other parts of the query can use those names to create edges between the vertices, using an arrow notation: (Idaho) -[:WITHIN]-> (USA) creates an edge labeled WITHIN, with Idaho as the tail node and USA as the head node.

示例 2-3。图 2-5中的数据子集,表示为 Cypher 查询
CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location      {name:'United States', type:'country'  }),
  (Idaho:Location    {name:'Idaho',         type:'state'    }),
  (Lucy:Person       {name:'Lucy' }),
  (Idaho) -[:WITHIN]->  (USA)  -[:WITHIN]-> (NAmerica),
  (Lucy)  -[:BORN_IN]-> (Idaho)
CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location      {name:'United States', type:'country'  }),
  (Idaho:Location    {name:'Idaho',         type:'state'    }),
  (Lucy:Person       {name:'Lucy' }),
  (Idaho) -[:WITHIN]->  (USA)  -[:WITHIN]-> (NAmerica),
  (Lucy)  -[:BORN_IN]-> (Idaho)

当图 2-5的所有顶点和边都添加到数据库中时,我们可以开始提出有趣的问题:例如,查找所有从美国移民到欧洲的人的姓名。更准确地说,这里我们想要找到所有具有BORN_IN到美国境内某个位置的边以及LIVING_IN到欧洲境内某个位置的边的顶点,并返回 name每个顶点的属性。

When all the vertices and edges of Figure 2-5 are added to the database, we can start asking interesting questions: for example, find the names of all the people who emigrated from the United States to Europe. To be more precise, here we want to find all the vertices that have a BORN_IN edge to a location within the US, and also a LIVING_IN edge to a location within Europe, and return the name property of each of those vertices.

示例 2-4显示了如何在 Cypher 中表达该查询。子句中使用相同的箭头表示法 MATCH来查找图中的模式:(person) -[:BORN_IN]-> ()匹配由标记为 的边相关的任意两个顶点BORN_IN。该边的尾部顶点绑定到变量person,而头顶点未命名。

Example 2-4 shows how to express that query in Cypher. The same arrow notation is used in a MATCH clause to find patterns in the graph: (person) -[:BORN_IN]-> () matches any two vertices that are related by an edge labeled BORN_IN. The tail vertex of that edge is bound to the variable person, and the head vertex is left unnamed.

示例 2-4。使用 Cypher 查询查找从美国移民到欧洲的人
MATCH
  (person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
  (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'})
RETURN person.name
MATCH
  (person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
  (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'})
RETURN person.name

该查询可以解读如下:

The query can be read as follows:

person找到满足以下两个条件的任何顶点(称为):

  1. personBORN_IN到某个顶点的出边。从该顶点开始,您可以遵循一系列传出边,WITHIN直到最终到达类型为 的顶点Location,其name 属性等于"United States"

  2. 同一个person顶点也有一个出LIVES_IN边。沿着该边,然后是一系列传出边WITHIN,您最终到达类型为 的顶点Location,其name 属性等于"Europe"

对于每个这样的person顶点,返回name属性。

Find any vertex (call it person) that meets both of the following conditions:

  1. person has an outgoing BORN_IN edge to some vertex. From that vertex, you can follow a chain of outgoing WITHIN edges until eventually you reach a vertex of type Location, whose name property is equal to "United States".

  2. That same person vertex also has an outgoing LIVES_IN edge. Following that edge, and then a chain of outgoing WITHIN edges, you eventually reach a vertex of type Location, whose name property is equal to "Europe".

For each such person vertex, return the name property.

有多种可能的执行查询的方法。这里给出的描述建议您首先扫描数据库中的所有人员,检查每个人的出生地和居住地,然后仅返回符合条件的人员。

There are several possible ways of executing the query. The description given here suggests that you start by scanning all the people in the database, examine each person’s birthplace and residence, and return only those people who meet the criteria.

但同样地,您可以从两个Location顶点开始并向后进行。如果该name属性有一个索引,您可能可以有效地找到代表美国和欧洲的两个顶点。然后,您可以通过跟踪所有传入边来分别查找美国和欧洲的所有位置(州、地区、城市等)WITHIN。最后,您可以寻找可以通过位置顶点之一的传入BORN_IN或边找到的人。LIVES_IN

But equivalently, you could start with the two Location vertices and work backward. If there is an index on the name property, you can probably efficiently find the two vertices representing the US and Europe. Then you can proceed to find all locations (states, regions, cities, etc.) in the US and Europe respectively by following all incoming WITHIN edges. Finally, you can look for people who can be found through an incoming BORN_IN or LIVES_IN edge at one of the location vertices.

正如声明性查询语言的典型情况一样,您在编写查询时不需要指定此类执行细节:查询优化器会自动选择预测最有效的策略,因此您可以继续编写其余的部分你的申请。

As is typical for a declarative query language, you don’t need to specify such execution details when writing the query: the query optimizer automatically chooses the strategy that is predicted to be the most efficient, so you can get on with writing the rest of your application.

SQL 中的图形查询

Graph Queries in SQL

示例 2-2表明图形数据可以在关系数据库中表示。但是如果我们把图数据放在关系结构中,我们是否也可以使用SQL来查询它呢?

Example 2-2 suggested that graph data can be represented in a relational database. But if we put graph data in a relational structure, can we also query it using SQL?

答案是肯定的,但有一定难度。在关系数据库中,您通常预先知道查询中需要哪些联接。在图查询中,在找到要查找的顶点之前,您可能需要遍历可变数量的边,也就是说,连接的数量并不是预先固定的。

The answer is yes, but with some difficulty. In a relational database, you usually know in advance which joins you need in your query. In a graph query, you may need to traverse a variable number of edges before you find the vertex you’re looking for—that is, the number of joins is not fixed in advance.

在我们的示例中,这发生在() -[:WITHIN*0..]-> ()Cypher 查询的规则中。人的 LIVES_IN边缘可以指向任何类型的位置:街道、城市、区、地区、州等。城市可以是地区WITHIN、地区、州WITHIN、州WITHIN、国家等。LIVES_IN边缘可以是直接指向您要查找的位置顶点,或者它可能是位置层次结构中删除的多个级别。

In our example, that happens in the () -[:WITHIN*0..]-> () rule in the Cypher query. A person’s LIVES_IN edge may point at any kind of location: a street, a city, a district, a region, a state, etc. A city may be WITHIN a region, a region WITHIN a state, a state WITHIN a country, etc. The LIVES_IN edge may point directly at the location vertex you’re looking for, or it may be several levels removed in the location hierarchy.

在 Cypher 中,:WITHIN*0..非常简洁地表达了这一事实:它的意思是“沿着一条WITHIN边,零次或多次”。它就像*正则表达式中的运算符。

In Cypher, :WITHIN*0.. expresses that fact very concisely: it means “follow a WITHIN edge, zero or more times.” It is like the * operator in a regular expression.

从 SQL:1999 开始,查询中的可变长度遍历路径的想法可以使用称为递归公用表表达式WITH RECURSIVE语法)的东西来表达。 示例 2-5显示了使用此技术(在 PostgreSQL、IBM DB2、Oracle 和 SQL Server 中支持)的 SQL 中表达的相同查询(查找从美国移民到欧洲的人的姓名)。然而,与 Cypher 相比,其语法非常笨拙。

Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using something called recursive common table expressions (the WITH RECURSIVE syntax). Example 2-5 shows the same query—finding the names of people who emigrated from the US to Europe—expressed in SQL using this technique (supported in PostgreSQL, IBM DB2, Oracle, and SQL Server). However, the syntax is very clumsy in comparison to Cypher.

示例 2-5。与示例 2-4相同的查询,使用递归公用表表达式以 SQL 表示
WITH RECURSIVE

  -- in_usa is the set of vertex IDs of all locations within the United States
  in_usa(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'United States' 1
    UNION
      SELECT edges.tail_vertex FROM edges 2
        JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
        WHERE edges.label = 'within'
  ),

  -- in_europe is the set of vertex IDs of all locations within Europe
  in_europe(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'Europe' 3
    UNION
      SELECT edges.tail_vertex FROM edges
        JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
        WHERE edges.label = 'within'
  ),

  -- born_in_usa is the set of vertex IDs of all people born in the US
  born_in_usa(vertex_id) AS ( 4
    SELECT edges.tail_vertex FROM edges
      JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
      WHERE edges.label = 'born_in'
  ),

  -- lives_in_europe is the set of vertex IDs of all people living in Europe
  lives_in_europe(vertex_id) AS ( 5
    SELECT edges.tail_vertex FROM edges
      JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
      WHERE edges.label = 'lives_in'
  )

SELECT vertices.properties->>'name'
FROM vertices
-- join to find those people who were both born in the US *and* live in Europe
JOIN born_in_usa     ON vertices.vertex_id = born_in_usa.vertex_id 6
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;
WITH RECURSIVE

  -- in_usa is the set of vertex IDs of all locations within the United States
  in_usa(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'United States' 
    UNION
      SELECT edges.tail_vertex FROM edges 
        JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
        WHERE edges.label = 'within'
  ),

  -- in_europe is the set of vertex IDs of all locations within Europe
  in_europe(vertex_id) AS (
      SELECT vertex_id FROM vertices WHERE properties->>'name' = 'Europe' 
    UNION
      SELECT edges.tail_vertex FROM edges
        JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
        WHERE edges.label = 'within'
  ),

  -- born_in_usa is the set of vertex IDs of all people born in the US
  born_in_usa(vertex_id) AS ( 
    SELECT edges.tail_vertex FROM edges
      JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
      WHERE edges.label = 'born_in'
  ),

  -- lives_in_europe is the set of vertex IDs of all people living in Europe
  lives_in_europe(vertex_id) AS ( 
    SELECT edges.tail_vertex FROM edges
      JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
      WHERE edges.label = 'lives_in'
  )

SELECT vertices.properties->>'name'
FROM vertices
-- join to find those people who were both born in the US *and* live in Europe
JOIN born_in_usa     ON vertices.vertex_id = born_in_usa.vertex_id 
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;
1

name首先找到属性值为的顶点"United States",并将其作为顶点集合的第一个元素in_usa

First find the vertex whose name property has the value "United States", and make it the first element of the set of vertices in_usa.

2

跟踪within来自集合 中顶点的所有传入边in_usa,并将它们添加到同一集合中,直到所有传入within边都被访问过。

Follow all incoming within edges from vertices in the set in_usa, and add them to the same set, until all incoming within edges have been visited.

3

name从属性值为的顶点开始执行相同的操作"Europe",并构建顶点集in_europe

Do the same starting with the vertex whose name property has the value "Europe", and build up the set of vertices in_europe.

4

对于集合中的每个顶点in_usa,沿着传入born_in边查找出生在美国某个地方的人。

For each of the vertices in the set in_usa, follow incoming born_in edges to find people who were born in some place within the United States.

5

同样,对于集合中的每个顶点in_europe,沿着传入lives_in边查找居住在欧洲的人。

Similarly, for each of the vertices in the set in_europe, follow incoming lives_in edges to find people who live in Europe.

6

最后,通过加入将在美国出生的人与居住在欧洲的人相交。

Finally, intersect the set of people born in the USA with the set of people living in Europe, by joining them.

如果相同的查询可以用一种查询语言编写 4 行,但用另一种查询语言需要 29 行,这就表明不同的数据模型是为了满足不同的用例而设计的。选择适合您的应用程序的数据模型非常重要。

If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases. It’s important to pick a data model that is suitable for your application.

三重存储和 SPARQL

Triple-Stores and SPARQL

三元存储模型基本上等同于属性图模型,使用不同的词语来描述相同的想法。尽管如此,它还是值得讨论的,因为有多种用于三重存储的工具和语言,可以为您构建应用程序的工具箱提供有价值的补充。

The triple-store model is mostly equivalent to the property graph model, using different words to describe the same ideas. It is nevertheless worth discussing, because there are various tools and languages for triple-stores that can be valuable additions to your toolbox for building applications.

在三元组存储中,所有信息都以非常简单的三部分语句的形式存储:(主语谓语宾语)。例如,在三元组(Jimlikesbananas)中,Jim是主语,likes是谓语(动词),bananas是宾语。

In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object.

三元组的主语相当于图中的一个顶点。该对象是以下两件事之一:

The subject of a triple is equivalent to a vertex in a graph. The object is one of two things:

  1. 原始数据类型中的值,例如字符串或数字。在这种情况下,三元组的谓词和宾语相当于主语顶点上属性的键和值。例如,( lucy , age , 33 )就像一个lucy具有properties的 顶点{"age":33}

  2. A value in a primitive datatype, such as a string or a number. In that case, the predicate and object of the triple are equivalent to the key and value of a property on the subject vertex. For example, (lucy, age, 33) is like a vertex lucy with properties {"age":33}.

  3. 图中的另一个顶点。在这种情况下,谓词是图中的边,主语是尾部顶点,宾语是头顶点。例如,在 ( lucy , MarriedTo , alain ) 中,主语和宾语lucyalain都是顶点,谓词MarriedTo是连接它们的边的标签。

  4. Another vertex in the graph. In that case, the predicate is an edge in the graph, the subject is the tail vertex, and the object is the head vertex. For example, in (lucy, marriedTo, alain) the subject and object lucy and alain are both vertices, and the predicate marriedTo is the label of the edge that connects them.

示例 2-6显示了与示例 2-3中相同的数据,以称为Turtle的格式编写为三元组,这是Notation3 ( N3 ) [ 39 ]的子集。

Example 2-6 shows the same data as in Example 2-3, written as triples in a format called Turtle, a subset of Notation3 (N3) [39].

示例 2-6。图 2-5中的数据子集,表示为 Turtle 三元组
@前缀:<瓮:示例:>。
_:露西:人。
_:露西:名字“露西”。
_:露西:出生于_:爱达荷州。
_:爱达荷州 a:位置。
_:idaho :名称“爱达荷”。
_:爱达荷州:输入“州”。
_:爱达荷州:_:美国境内。
_:美国 a:位置。
_:usa :名称“美国”。
_:美国:输入“国家”。
_:美国:_:美洲境内。
_:namerica :位置。
_:namerica :名称“北美”。
_:namerica :输入“大陆”。
@prefix : <urn:example:>.
_:lucy     a       :Person.
_:lucy     :name   "Lucy".
_:lucy     :bornIn _:idaho.
_:idaho    a       :Location.
_:idaho    :name   "Idaho".
_:idaho    :type   "state".
_:idaho    :within _:usa.
_:usa      a       :Location.
_:usa      :name   "United States".
_:usa      :type   "country".
_:usa      :within _:namerica.
_:namerica a       :Location.
_:namerica :name   "North America".
_:namerica :type   "continent".

在此示例中,图的顶点被写为。该名称在该文件之外没有任何含义;它存在只是因为否则我们不知道哪些三元组引用同一个顶点。当谓词表示一条边时,对象是一个顶点,如 中。当谓词是属性时,对象是字符串文字,如 中所示。_:someName_:idaho :within _:usa_:usa :name "United States"

In this example, vertices of the graph are written as _:someName. The name doesn’t mean anything outside of this file; it exists only because we otherwise wouldn’t know which triples refer to the same vertex. When the predicate represents an edge, the object is a vertex, as in _:idaho :within _:usa. When the predicate is a property, the object is a string literal, as in _:usa :name "United States".

一遍又一遍地重复同一主题是相当重复的,但幸运的是,您可以使用分号来表达同一主题的多个内容。这使得 Turtle 格式非常漂亮且可读:参见示例 2-7

It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use semicolons to say multiple things about the same subject. This makes the Turtle format quite nice and readable: see Example 2-7.

示例 2-7。例2-6中更简洁的数据写入方式
@前缀:<瓮:示例:>。
_:露西一个:人;:姓名“露西”;:出生于_:爱达荷州。
_:爱达荷州:位置;:名称“爱达荷州”;:输入“状态”;:在_:美国境内。
_:美国 a :位置;:名称“美国”;:输入“国家”;:在_:美洲内。
_: 美洲 a : 位置; :名称“北美”;:输入“大陆”。
@prefix : <urn:example:>.
_:lucy     a :Person;   :name "Lucy";          :bornIn _:idaho.
_:idaho    a :Location; :name "Idaho";         :type "state";   :within _:usa.
_:usa      a :Location; :name "United States"; :type "country"; :within _:namerica.
_:namerica a :Location; :name "North America"; :type "continent".

语义网

The semantic web

如果您阅读更多有关三重存储的内容,您可能会陷入有关语义网络的文章漩涡中。三元存储数据模型完全独立于语义网——例如,Datomic [ 40 ] 就是一个三元存储,但并不声称与其有任何关系。vii 但由于两者在许多人的心目中联系如此紧密,我们应该简单地讨论一下。

If you read more about triple-stores, you may get sucked into a maelstrom of articles written about the semantic web. The triple-store data model is completely independent of the semantic web—for example, Datomic [40] is a triple-store that does not claim to have anything to do with it.vii But since the two are so closely linked in many people’s minds, we should discuss them briefly.

语义网从根本上来说是一个简单而合理的想法:网站已经以文本和图片的形式发布信息供人类阅读,那么为什么它们不也以机器可读的数据形式发布信息供计算机阅读呢?资源描述框架(RDF)[ 41 ]旨在作为不同网站以一致格式发布数据的机制,允许来自不同网站的数据自动组合成数据网络——一种互联网范围的“数据库”。一切。”

The semantic web is fundamentally a simple and reasonable idea: websites already publish information as text and pictures for humans to read, so why don’t they also publish information as machine-readable data for computers to read? The Resource Description Framework (RDF) [41] was intended as a mechanism for different websites to publish data in a consistent format, allowing data from different websites to be automatically combined into a web of data—a kind of internet-wide “database of everything.”

不幸的是,语义网在2000年代初被过度炒作,但迄今为止尚未显示出任何在实践中实现的迹象,这让许多人对此感到愤世嫉俗。它还受到令人眼花缭乱的缩略词过多、过于复杂的标准提案和傲慢的困扰。

Unfortunately, the semantic web was overhyped in the early 2000s but so far hasn’t shown any sign of being realized in practice, which has made many people cynical about it. It has also suffered from a dizzying plethora of acronyms, overly complex standards proposals, and hubris.

然而,如果你回顾这些失败,就会发现语义 Web 项目也做出了很多出色的工作。即使您对在语义 Web 上发布 RDF 数据不感兴趣,三元组也可以成为应用程序的良好内部数据模型。

However, if you look past those failings, there is also a lot of good work that has come out of the semantic web project. Triples can be a good internal data model for applications, even if you have no interest in publishing RDF data on the semantic web.

RDF 数据模型

The RDF data model

我们在示例 2-7 中使用的 Turtle 语言是一种人类可读的 RDF 数据格式。有时,RDF 也以 XML 格式编写,这会更加冗长地完成同样的事情 — 请参阅 示例 2-8。Turtle/N3 更可取,因为它更容易看清,并且 Apache Jena [ 42 ] 等工具可以在必要时自动在不同的 RDF 格式之间进行转换。

The Turtle language we used in Example 2-7 is a human-readable format for RDF data. Sometimes RDF is also written in an XML format, which does the same thing much more verbosely—see Example 2-8. Turtle/N3 is preferable as it is much easier on the eyes, and tools like Apache Jena [42] can automatically convert between different RDF formats if necessary.

示例 2-8。例2-7的数据,使用RDF/XML语法表示
<rdf:RDF xmlns="urn:example:"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <Location rdf:nodeID="idaho">
    <name>爱达荷州</name>
    <type>美国国家北美洲大陆</type>
    <within>
      <Location rdf:nodeID="usa">
        <name>_</name>
        <type></type>
        <within>
          <Location rdf:nodeID="namerica">
            <name></name>
            <type></type>
          </Location>
        </within>
      </Location>
    </within>
  </Location>

  <Person rdf:nodeID="lucy">
    <name>露西</name>
    <bornIn rdf:nodeID="idaho"/>
  </Person>
</rdf:RDF>
<rdf:RDF xmlns="urn:example:"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <Location rdf:nodeID="idaho">
    <name>Idaho</name>
    <type>state</type>
    <within>
      <Location rdf:nodeID="usa">
        <name>United States</name>
        <type>country</type>
        <within>
          <Location rdf:nodeID="namerica">
            <name>North America</name>
            <type>continent</type>
          </Location>
        </within>
      </Location>
    </within>
  </Location>

  <Person rdf:nodeID="lucy">
    <name>Lucy</name>
    <bornIn rdf:nodeID="idaho"/>
  </Person>
</rdf:RDF>

RDF 有一些怪癖,因为它是为互联网范围内的数据交换而设计的。三元组的主语、谓语和宾语通常是 URI。例如,谓词可能是 URI,例如<http://my-company.com/namespace#within>or <http://my-company.com/namespace#lives_in>,而不仅仅是WITHINor LIVES_IN。这种设计背后的原因是,您应该能够将您的数据与其他人的数据组合起来,并且如果他们对单词 withinor赋予不同的含义lives_in,您不会遇到冲突,因为他们的谓词实际上是 <http://other.org/foo#within>and <http://other.org/foo#lives_in>

RDF has a few quirks due to the fact that it is designed for internet-wide data exchange. The subject, predicate, and object of a triple are often URIs. For example, a predicate might be an URI such as <http://my-company.com/namespace#within> or <http://my-company.com/namespace#lives_in>, rather than just WITHIN or LIVES_IN. The reasoning behind this design is that you should be able to combine your data with someone else’s data, and if they attach a different meaning to the word within or lives_in, you won’t get a conflict because their predicates are actually <http://other.org/foo#within> and <http://other.org/foo#lives_in>.

URL<http://my-company.com/namespace>不一定需要解析为任何内容 — 从 RDF 的角度来看,它只是一个名称空间。为了避免与http://URL 的潜在混淆,本节中的示例使用不可解析的 URI,例如urn:example:within. 幸运的是,您只需在文件顶部指定一次此前缀,然后就可以忘记它。

The URL <http://my-company.com/namespace> doesn’t necessarily need to resolve to anything—from RDF’s point of view, it is simply a namespace. To avoid potential confusion with http:// URLs, the examples in this section use non-resolvable URIs such as urn:example:within. Fortunately, you can just specify this prefix once at the top of the file, and then forget about it.

SPARQL 查询语言

The SPARQL query language

SPARQL是一种使用 RDF 数据模型的三元组存储查询语言[ 43 ]。(它是SPARQL Protocol 和 RDF Query Language的缩写,发音为“sparkle”。)它早于 Cypher,并且由于 Cypher 的模式匹配是从 SPARQL 借用的,因此它们看起来非常相似 [ 37 ]。

SPARQL is a query language for triple-stores using the RDF data model [43]. (It is an acronym for SPARQL Protocol and RDF Query Language, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar [37].

与之前相同的查询(查找从美国搬到欧洲的人)在 SPARQL 中比在 Cypher 中更加简洁(参见示例 2-9)。

The same query as before—finding people who have moved from the US to Europe—is even more concise in SPARQL than it is in Cypher (see Example 2-9).

例2-9。与示例 2-4相同的查询,以 SPARQL 表示
PREFIX : <urn:example:>

SELECT ?personName WHERE {
  ?person :name ?personName.
  ?person :bornIn  / :within* / :name "United States".
  ?person :livesIn / :within* / :name "Europe".
}
PREFIX : <urn:example:>

SELECT ?personName WHERE {
  ?person :name ?personName.
  ?person :bornIn  / :within* / :name "United States".
  ?person :livesIn / :within* / :name "Europe".
}

结构非常相似。以下两个表达式是等效的(变量在 SPARQL 中以问号开头):

The structure is very similar. The following two expressions are equivalent (variables start with a question mark in SPARQL):

(人) -[:BORN_IN]-> () -[:WITHIN*0..]-> (位置) # Cypher

?人:出生/:在* ?地点。#SPARQL
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location)   # Cypher

?person :bornIn / :within* ?location.                   # SPARQL

由于 RDF 不区分属性和边,而只是对两者使用谓词,因此您可以使用相同的语法来匹配属性。在以下表达式中,变量usa绑定到具有name值为 string 的属性的任何顶点"United States"

Because RDF doesn’t distinguish between properties and edges but just uses predicates for both, you can use the same syntax for matching properties. In the following expression, the variable usa is bound to any vertex that has a name property whose value is the string "United States":

(usa {name:'美国'}) # Cypher

?usa :名称“美国”。#SPARQL
(usa {name:'United States'})   # Cypher

?usa :name "United States".    # SPARQL

SPARQL 是一种很好的查询语言——即使语义网从未出现,它也可以成为应用程序内部使用的强大工具。

SPARQL is a nice query language—even if the semantic web never happens, it can be a powerful tool for applications to use internally.

基金会:数据记录

The Foundation: Datalog

Datalog一种比 SPARQL 或 Cypher 更古老的语言 ,在20世纪 80 年代已被学者广泛研究 [ 44,45,46 ]。它在软件工程师中不太为人所知,但它仍然很重要,因为它为后来的查询语言构建提供了基础。

Datalog is a much older language than SPARQL or Cypher, having been studied extensively by academics in the 1980s [44, 45, 46]. It is less well known among software engineers, but it is nevertheless important, because it provides the foundation that later query languages build upon.

在实践中,Datalog被用在一些数据系统中:例如,它是Datomic[ 40 ] 的查询语言,而Cascalog[ 47 ]是在Hadoop中查询大型数据集的Datalog实现。

In practice, Datalog is used in a few data systems: for example, it is the query language of Datomic [40], and Cascalog [47] is a Datalog implementation for querying large datasets in Hadoop.viii

Datalog的数据模型类似于三重存储模型,稍微概括一下。我们不将三元组写为(主语谓语宾语),而是将其写为谓语主语宾语)。 示例 2-10显示了如何将示例中的数据写入 Datalog。

Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as (subject, predicate, object), we write it as predicate(subject, object). Example 2-10 shows how to write the data from our example in Datalog.

示例 2-10。图 2-5中数据的子集,表示为数据日志事实
name(namerica, 'North America').
type(namerica, continent).

name(usa, 'United States').
type(usa, country).
within(usa, namerica).

name(idaho, 'Idaho').
type(idaho, state).
within(idaho, usa).

name(lucy, 'Lucy').
born_in(lucy, idaho).
name(namerica, 'North America').
type(namerica, continent).

name(usa, 'United States').
type(usa, country).
within(usa, namerica).

name(idaho, 'Idaho').
type(idaho, state).
within(idaho, usa).

name(lucy, 'Lucy').
born_in(lucy, idaho).

现在我们已经定义了数据,我们可以编写与之前相同的查询,如 示例 2-11所示。它看起来与 Cypher 或 SPARQL 中的等效项有点不同,但不要因此而失望。Datalog 是 Prolog 的子集,如果您学过计算机科学,您可能以前见过 Prolog。

Now that we have defined the data, we can write the same query as before, as shown in Example 2-11. It looks a bit different from the equivalent in Cypher or SPARQL, but don’t let that put you off. Datalog is a subset of Prolog, which you might have seen before if you’ve studied computer science.

示例 2-11。与示例 2-4相同的查询,以 Datalog 表示
within_recursive(Location, Name) :- name(Location, Name).     /* Rule 1 */

within_recursive(Location, Name) :- within(Location, Via),    /* Rule 2 */
                                    within_recursive(Via, Name).

migrated(Name, BornIn, LivingIn) :- name(Person, Name),       /* Rule 3 */
                                    born_in(Person, BornLoc),
                                    within_recursive(BornLoc, BornIn),
                                    lives_in(Person, LivingLoc),
                                    within_recursive(LivingLoc, LivingIn).

?- migrated(Who, 'United States', 'Europe').
/* Who = 'Lucy'. */
within_recursive(Location, Name) :- name(Location, Name).     /* Rule 1 */

within_recursive(Location, Name) :- within(Location, Via),    /* Rule 2 */
                                    within_recursive(Via, Name).

migrated(Name, BornIn, LivingIn) :- name(Person, Name),       /* Rule 3 */
                                    born_in(Person, BornLoc),
                                    within_recursive(BornLoc, BornIn),
                                    lives_in(Person, LivingLoc),
                                    within_recursive(LivingLoc, LivingIn).

?- migrated(Who, 'United States', 'Europe').
/* Who = 'Lucy'. */

Cypher 和 SPARQL 立即加入SELECT,但 Datalog 一次只迈一小步。我们定义告诉数据库新谓词的 规则within_recursive:在这里,我们定义两个新谓词和migrated。这些谓词不是存储在数据库中的三元组,而是从数据或其他规则派生的。规则可以引用其他规则,就像函数可以调用其他函数或递归调用自身一样。像这样,复杂的查询可以一次构建一小部分。

Cypher and SPARQL jump in right away with SELECT, but Datalog takes a small step at a time. We define rules that tell the database about new predicates: here, we define two new predicates, within_recursive and migrated. These predicates aren’t triples stored in the database, but instead they are derived from data or from other rules. Rules can refer to other rules, just like functions can call other functions or recursively call themselves. Like this, complex queries can be built up a small piece at a time.

在规则中,以大写字母开头的单词是变量,谓词的匹配类似于 Cypher 和 SPARQL。例如,name(Location, Name)将三元组 name(namerica, 'North America')与变量绑定 Location = namerica和进行匹配Name = 'North America'

In rules, words that start with an uppercase letter are variables, and predicates are matched like in Cypher and SPARQL. For example, name(Location, Name) matches the triple name(namerica, 'North America') with variable bindings Location = namerica and Name = 'North America'.

如果系统可以找到运算符右侧 所有谓词的匹配项,则规则适用:-。当规则应用时,就好像 的左侧:-被添加到数据库中(变量被它们匹配的值替换)。

A rule applies if the system can find a match for all predicates on the righthand side of the :- operator. When the rule applies, it’s as though the lefthand side of the :- was added to the database (with variables replaced by the values they matched).

应用规则的一种可能的方式是:

One possible way of applying the rules is thus:

  1. name(namerica, 'North America')存在于数据库中,因此规则 1 适用。它生成within_recursive(namerica, 'North America').

  2. name(namerica, 'North America') exists in the database, so rule 1 applies. It generates within_recursive(namerica, 'North America').

  3. within(usa, namerica)存在于数据库中且上一步生成 within_recursive(namerica, 'North America'),因此规则 2 适用。它生成 within_recursive(usa, 'North America').

  4. within(usa, namerica) exists in the database and the previous step generated within_recursive(namerica, 'North America'), so rule 2 applies. It generates within_recursive(usa, 'North America').

  5. within(idaho, usa)存在于数据库中且上一步生成 within_recursive(usa, 'North America'),因此规则 2 适用。它生成 within_recursive(idaho, 'North America').

  6. within(idaho, usa) exists in the database and the previous step generated within_recursive(usa, 'North America'), so rule 2 applies. It generates within_recursive(idaho, 'North America').

通过重复应用规则 1 和 2,within_recursive谓词可以告诉我们数据库中包含的北美所有位置(或任何其他位置名称)。该过程如图2-6所示。

By repeated application of rules 1 and 2, the within_recursive predicate can tell us all the locations in North America (or any other location name) contained in our database. This process is illustrated in Figure 2-6.

迪迪亚0206
图 2-6。使用示例 2-11中的数据记录规则确定爱达荷州位于北美。

现在规则 3 可以找到出生在某个地点BornIn并居住在某个地点 的人LivingIn。通过使用BornIn = 'United States'and 进行查询LivingIn = 'Europe',并将 person 作为变量Who,我们要求 Datalog 系统找出该变量可以出现哪些值Who。因此,最终我们得到了与之前的 Cypher 和 SPARQL 查询相同的答案。

Now rule 3 can find people who were born in some location BornIn and live in some location LivingIn. By querying with BornIn = 'United States' and LivingIn = 'Europe', and leaving the person as a variable Who, we ask the Datalog system to find out which values can appear for the variable Who. So, finally we get the same answer as in the earlier Cypher and SPARQL queries.

Datalog 方法需要与本章讨论的其他查询语言不同的思维,但它是一种非常强大的方法,因为规则可以在不同的查询中组合和重用。对于简单的一次性查询来说不太方便,但如果数据很复杂,它可以更好地应对。

The Datalog approach requires a different kind of thinking to the other query languages discussed in this chapter, but it’s a very powerful approach, because rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.

概括

Summary

数据模型是一个庞大的主题,在本章中我们快速浏览了各种不同的模型。我们没有足够的空间来详细介绍每个模型的所有细节,但希望概述足以激发您的兴趣,以了解有关最适合您的应用程序要求的模型的更多信息。

Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of different models. We didn’t have space to go into all the details of each model, but hopefully the overview has been enough to whet your appetite to find out more about the model that best fits your application’s requirements.

从历史上看,数据最初被表示为一棵大树(层次模型),但这并不适合表示多对多关系,因此发明了关系模型来解决这个问题。最近,开发人员发现某些应用程序也不太适合关系模型。新的非关系型“NoSQL”数据存储有两个主要方向:

Historically, data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:

  1. 文档数据库的目标用例是数据来自独立的文档,并且一个文档与另一个文档之间的关系很少。

  2. Document databases target use cases where data comes in self-contained documents and relationships between one document and another are rare.

  3. 图形数据库则朝相反的方向发展,其目标是任何事物都可能与所有事物相关的用例。

  4. Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.

所有这三种模型(文档、关系和图形)如今都被广泛使用,并且每种模型在各自的领域中都表现出色。一个模型可以用另一种模型来模拟——例如,图形数据可以用关系数据库来表示——但结果往往很尴尬。这就是为什么我们针对不同的目的采用不同的系统,而不是一种万能的解决方案。

All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. One model can be emulated in terms of another model—for example, graph data can be represented in a relational database—but the result is often awkward. That’s why we have different systems for different purposes, not a single one-size-fits-all solution.

文档数据库和图形数据库的一个共同点是,它们通常不会为其存储的数据强制实施架构,这可以使应用程序更轻松地适应不断变化的需求。然而,您的应用程序很可能仍然假设数据具有一定的结构;这只是模式是显式(在写入时强制)还是隐式(在读取时处理)的问题。

One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements. However, your application most likely still assumes that data has a certain structure; it’s just a question of whether the schema is explicit (enforced on write) or implicit (handled on read).

每个数据模型都有自己的查询语言或框架,我们讨论了几个示例:SQL、MapReduce、MongoDB 的聚合管道、Cypher、SPARQL 和 Datalog。我们还讨论了 CSS 和 XSL/XPath,它们不是数据库查询语言,但有有趣的相似之处。

Each data model comes with its own query language or framework, and we discussed several examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog. We also touched on CSS and XSL/XPath, which aren’t database query languages but have interesting parallels.

尽管我们已经介绍了很多内容,但仍有许多数据模型没有提及。举几个简单的例子:

Although we have covered a lot of ground, there are still many data models left unmentioned. To give just a few brief examples:

  • 研究基因组数据的研究人员经常需要执行序列相似性搜索,这意味着获取一个很长的字符串(代表 DNA 分子)并将其与一个大型数据库中相似但不相同的字符串进行匹配。这里描述的数据库都不能处理这种用法,这就是为什么研究人员编写了专门的基因组数据库软件,如 GenBank [ 48 ]。

  • Researchers working with genome data often need to perform sequence-similarity searches, which means taking one very long string (representing a DNA molecule) and matching it against a large database of strings that are similar, but not identical. None of the databases described here can handle this kind of usage, which is why researchers have written specialized genome database software like GenBank [48].

  • 几十年来,粒子物理学家一直在进行大数据式的大规模数据分析,像大型强子对撞机 (LHC) 这样的项目现在可以处理数百 PB 的数据!在如此规模下,需要定制解决方案来阻止硬件成本失控[ 49 ]。

  • Particle physicists have been doing Big Data–style large-scale data analysis for decades, and projects like the Large Hadron Collider (LHC) now work with hundreds of petabytes! At such a scale custom solutions are required to stop the hardware cost from spiraling out of control [49].

  • 全文搜索可以说是一种经常与数据库一起使用的数据模型。信息检索是一个很大的专业主题,我们不会在本书中详细讨论,但我们将在第3 章和第 III 部分中讨论搜索索引。

  • Full-text search is arguably a kind of data model that is frequently used alongside databases. Information retrieval is a large specialist subject that we won’t cover in great detail in this book, but we’ll touch on search indexes in Chapter 3 and Part III.

我们现在必须把它留在那里。在下一章中,我们将讨论在实现本章中描述的数据模型 时发挥作用的一些权衡。

We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that come into play when implementing the data models described in this chapter.

脚注

i借用自电子学的术语。每个电路的输入和输出都有一定的阻抗(对交流电的电阻)。当您将一个电路的输出连接到另一个电路的输入时,如果两个电路的输出和输入阻抗匹配,则通过连接的功率传输将最大化。阻抗不匹配可能导致信号反射和其他问题。

i A term borrowed from electronics. Every electric circuit has a certain impedance (resistance to alternating current) on its inputs and outputs. When you connect one circuit’s output to another one’s input, the power transfer across the connection is maximized if the output and input impedances of the two circuits match. An impedance mismatch can lead to signal reflections and other troubles.

ii有关关系模型的文献区分了几种不同的范式,但这些区别没有什么实际意义。根据经验,如果您复制只能存储在一个位置的值,则架构不会标准化。

ii Literature on the relational model distinguishes several different normal forms, but the distinctions are of little practical interest. As a rule of thumb, if you’re duplicating values that could be stored in just one place, the schema is not normalized.

iii在撰写本文时,RethinkDB 支持连接,MongoDB 不支持,仅在 CouchDB 中预声明的视图中支持。

iii At the time of writing, joins are supported in RethinkDB, not supported in MongoDB, and only supported in predeclared views in CouchDB.

iv外键约束允许您限制修改,但关系模型不需要此类约束。即使有约束,外键上的联接也是在查询时执行的,而在 CODASYL 中,联接是在插入时有效完成的。

iv Foreign key constraints allow you to restrict modifications, but such constraints are not required by the relational model. Even with constraints, joins on foreign keys are performed at query time, whereas in CODASYL, the join was effectively done at insert time.

v Codd 对关系模型 [ 1 ] 的原始描述实际上允许在关系模式中执行与 JSON 文档非常相似的操作。他称之为非简单域。这个想法是,行中的值不必只是数字或字符串等原始数据类型,还可以是嵌套关系(表),因此您可以将任意嵌套的树结构作为值,就像 30 多年后添加到 SQL 中的 JSON 或 XML 支持一样。

v Codd’s original description of the relational model [1] actually allowed something quite similar to JSON documents within a relational schema. He called it nonsimple domains. The idea was that a value in a row doesn’t have to just be a primitive datatype like a number or a string, but could also be a nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like the JSON or XML support that was added to SQL over 30 years later.

vi IMS 和 CODASYL 都使用命令式查询 API。应用程序通常使用 COBOL 代码来迭代数据库中的记录,一次一条记录 [ 2 , 16 ]。

vi IMS and CODASYL both used imperative query APIs. Applications typically used COBOL code to iterate over records in the database, one record at a time [2, 16].

vii从技术上讲,Datomic 使用 5 元组而不是三元组;这两个附加字段是用于版本控制的元数据。

vii Technically, Datomic uses 5-tuples rather than triples; the two additional fields are metadata for versioning.

viii Datomic 和 Cascalog 使用 Clojure S 表达式语法进行 Datalog。在下面的示例中,我们使用 Prolog 语法,它更容易阅读,但这在功能上没有区别。

viii Datomic and Cascalog use a Clojure S-expression syntax for Datalog. In the following examples we use a Prolog syntax, which is a little easier to read, but this makes no functional difference.

参考

[ 1 ] Edgar F. Codd:“大型共享数据库的数据关系模型”,ACM 通讯,第 13 卷,第 6 期,第 377-387 页,1970 年 6 月 。doi:10.1145/362384.362685

[1] Edgar F. Codd: “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, volume 13, number 6, pages 377–387, June 1970. doi:10.1145/362384.362685

[ 2 ] Michael Stonebraker 和 Joseph M. Hellerstein:“ What Goes around Comes around ” ,数据库系统读物,第 4 版,麻省理工学院出版社,第 2-41 页,2005 年。ISBN:978-0-262-69314-1

[2] Michael Stonebraker and Joseph M. Hellerstein: “What Goes Around Comes Around,” in Readings in Database Systems, 4th edition, MIT Press, pages 2–41, 2005. ISBN: 978-0-262-69314-1

[ 3 ] Pramod J. Sadalage 和 Martin Fowler:NoSQL Distilled。艾迪生·韦斯利,2012 年 8 月。ISBN:978-0-321-82662-6

[3] Pramod J. Sadalage and Martin Fowler: NoSQL Distilled. Addison-Wesley, August 2012. ISBN: 978-0-321-82662-6

[ 4 ] Eric Evans:“ NoSQL:名字有什么含义?”,blog.sym-link.com,2009 年 10 月 30 日。

[4] Eric Evans: “NoSQL: What’s in a Name?,” blog.sym-link.com, October 30, 2009.

[ 5 ] James Phillips:“我们的 NoSQL 采用调查中的惊喜”,blog.couchbase.com,2012 年 2 月 8 日。

[5] James Phillips: “Surprises in Our NoSQL Adoption Survey,” blog.couchbase.com, February 8, 2012.

[ 6 ] Michael Wagner: SQL/XML:2006 – 数据银行系统标准规范评估。文凭出版社,汉堡,2010。ISBN:978-3-836-64609-3

[6] Michael Wagner: SQL/XML:2006 – Evaluierung der Standardkonformität ausgewählter Datenbanksysteme. Diplomica Verlag, Hamburg, 2010. ISBN: 978-3-836-64609-3

[ 7 ]“ SQL Server 中的 XML 数据”,SQL Server 2012 文档,technet.microsoft.com,2013 年。

[7] “XML Data in SQL Server,” SQL Server 2012 documentation, technet.microsoft.com, 2013.

[ 8 ]“ PostgreSQL 9.3.1 文档”,PostgreSQL 全球开发小组,2013 年。

[8] “PostgreSQL 9.3.1 Documentation,” The PostgreSQL Global Development Group, 2013.

[ 9 ]“ MongoDB 2.4 手册”,MongoDB, Inc.,2013 年。

[9] “The MongoDB 2.4 Manual,” MongoDB, Inc., 2013.

[ 10 ]“ RethinkDB 1.11 文档”,rethinkdb.com,2013 年。

[10] “RethinkDB 1.11 Documentation,” rethinkdb.com, 2013.

[ 11 ]“ Apache CouchDB 1.6 文档”,docs.couchdb.org,2014 年。

[11] “Apache CouchDB 1.6 Documentation,” docs.couchdb.org, 2014.

[ 12 ] Lin Qiao、Kapil Surlaker、Shirshanka Das 等人:“ On Brewing Fresh Espresso:LinkedIn 的分布式数据服务平台”,ACM 国际数据管理会议(SIGMOD),2013 年 6 月。

[12] Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[ 13 ] Rick Long、Mark Harrington、Robert Hain 和 Geoff Nicholls: IMS 入门。IBM 红皮书 SG24-5352-00,IBM 国际技术支持组织,2000 年 1 月。

[13] Rick Long, Mark Harrington, Robert Hain, and Geoff Nicholls: IMS Primer. IBM Redbook SG24-5352-00, IBM International Technical Support Organization, January 2000.

[ 14 ] Stephen D. Bartlett:“ IBM 的 IMS — 神话、现实和机遇”,The Clipper Group Navigator,TCG2013015LI,2013 年 7 月。

[14] Stephen D. Bartlett: “IBM’s IMS—Myths, Realities, and Opportunities,” The Clipper Group Navigator, TCG2013015LI, July 2013.

[ 15 ] Sarah Mei:“为什么你永远不应该使用 MongoDB ”, sarahmei.com,2013 年 11 月 11 日。

[15] Sarah Mei: “Why You Should Never Use MongoDB,” sarahmei.com, November 11, 2013.

[ 16 ] JS Knowles 和 DMR Bell:“CODASYL 模型”,《数据库——角色和结构:高级课程》,由 PM Stocker、PMD Gray 和 MP Atkinson 编辑,第 19-56 页,剑桥大学出版社,1984 年。ISBN :978-0-521-25430-4

[16] J. S. Knowles and D. M. R. Bell: “The CODASYL Model,” in Databases—Role and Structure: An Advanced Course, edited by P. M. Stocker, P. M. D. Gray, and M. P. Atkinson, pages 19–56, Cambridge University Press, 1984. ISBN: 978-0-521-25430-4

[ 17 ] Charles W. Bachman:“ The Programmer as Navigator ”, Communications of the ACM,第 16 卷,第 11 期,第 653–658 页,1973 年 11 月 。doi:10.1145/355611.362534

[17] Charles W. Bachman: “The Programmer as Navigator,” Communications of the ACM, volume 16, number 11, pages 653–658, November 1973. doi:10.1145/355611.362534

[ 18 ] Joseph M. Hellerstein、Michael Stonebraker 和 James Hamilton:“数据库系统的架构”, 数据库基础与趋势,第 1 卷,第 2 期,第 141-259 页,2007 年 11 月 。doi:10.1561/1900000002

[18] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “Architecture of a Database System,” Foundations and Trends in Databases, volume 1, number 2, pages 141–259, November 2007. doi:10.1561/1900000002

[ 19 ] Sandeep Parikh 和 Kelly Stirman:“ MongoDB 中时间序列数据的架构设计”,blog.mongodb.org,2013 年 10 月 30 日。

[19] Sandeep Parikh and Kelly Stirman: “Schema Design for Time Series Data in MongoDB,” blog.mongodb.org, October 30, 2013.

[ 20 ] Martin Fowler:“无模式数据结构”, martinfowler.com,2013 年 1 月 7 日。

[20] Martin Fowler: “Schemaless Data Structures,” martinfowler.com, January 7, 2013.

[ 21 ] Amr Awadallah:“读模式与写模式”,伯克利 EECS RAD 实验室静修会,加利福尼亚州圣克鲁斯,2009 年 5 月。

[21] Amr Awadallah: “Schema-on-Read vs. Schema-on-Write,” at Berkeley EECS RAD Lab Retreat, Santa Cruz, CA, May 2009.

[ 22 ] Martin Odersky:“类型的麻烦”,Strange Loop,2013 年 9 月。

[22] Martin Odersky: “The Trouble with Types,” at Strange Loop, September 2013.

[ 23 ] Conrad Irwin:“ MongoDB — PostgreSQL 爱好者的自白”,HTML5DevConf,2013 年 10 月。

[23] Conrad Irwin: “MongoDB—Confessions of a PostgreSQL Lover,” at HTML5DevConf, October 2013.

[ 24 ]“ Percona 工具包文档:pt-online-schema-change ”,Percona Ireland Ltd.,2013 年。

[24] “Percona Toolkit Documentation: pt-online-schema-change,” Percona Ireland Ltd., 2013.

[ 25 ]Rany Keddo、Tobias Bielohlawek 和 Tobias Schmidt:“大型强子迁移器”,SoundCloud,2013 年。

[25] Rany Keddo, Tobias Bielohlawek, and Tobias Schmidt: “Large Hadron Migrator,” SoundCloud, 2013.

[ 26 ] Shlomi Noach:“ gh-ost:GitHub 的 MySQL 在线架构迁移工具”,githubengineering.com,2016 年 8 月 1 日。

[26] Shlomi Noach: “gh-ost: GitHub’s Online Schema Migration Tool for MySQL,” githubengineering.com, August 1, 2016.

[ 27 ] James C. Corbett、Jeffrey Dean、Michael Epstein 等人:“ Spanner:Google 的全球分布式数据库”,第 10 届 USENIX 操作系统设计与实现(OSDI) 研讨会,2012 年 10 月。

[27] James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “Spanner: Google’s Globally-Distributed Database,” at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.

[ 28 ] Donald K. Burleson:“使用 Oracle 集群表减少 I/O ”,dba-oracle.com

[28] Donald K. Burleson: “Reduce I/O with Oracle Cluster Tables,” dba-oracle.com.

[ 29 ] Fay Chang、Jeffrey Dean、Sanjay Ghemawat 等人:“ Bigtable:结构化数据的分布式存储系统”,第 7 届 USENIX 操作系统设计与实现(OSDI) 研讨会,2006 年 11 月。

[29] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “Bigtable: A Distributed Storage System for Structured Data,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[ 30 ] Bobbie J. Cochrane 和 Kathy A. McKnight:“ DB2 JSON 功能,第 1 部分:DB2 JSON 简介”,IBM DeveloperWorks,2013 年 6 月 20 日。

[30] Bobbie J. Cochrane and Kathy A. McKnight: “DB2 JSON Capabilities, Part 1: Introduction to DB2 JSON,” IBM developerWorks, June 20, 2013.

[ 31 ] Herb Sutter:“免费午餐已经结束:软件并发的根本转变”,Dr. Dobb's Journal,第 30 卷,第 3 期,第 202-210 页,2005 年 3 月。

[31] Herb Sutter: “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software,” Dr. Dobb’s Journal, volume 30, number 3, pages 202-210, March 2005.

[ 32 ] Joseph M. Hellerstein:“声明式命令:分布式逻辑中的经验和猜想”,电气工程和计算机科学,加州大学伯克利分校,技术报告 UCB/EECS-2010-90,2010 年 6 月。

[32] Joseph M. Hellerstein: “The Declarative Imperative: Experiences and Conjectures in Distributed Logic,” Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech report UCB/EECS-2010-90, June 2010.

[ 33 ] Jeffrey Dean 和 Sanjay Ghemawat:“ MapReduce:大型集群上的简化数据处理”,第六届 USENIX 操作系统设计和实现(OSDI) 研讨会,2004 年 12 月。

[33] Jeffrey Dean and Sanjay Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[ 34 ] Craig Kerstiens:“ Postgres 中的 JavaScript ”, blog.heroku.com,2013 年 6 月 5 日。

[34] Craig Kerstiens: “JavaScript in Your Postgres,” blog.heroku.com, June 5, 2013.

[ 35 ] Nathan Bronson、Zach Amsden、George Cabrera 等人:“ TAO:Facebook 的社交图谱分布式数据存储”, USENIX 年度技术会议(USENIX ATC),2013 年 6 月。

[35] Nathan Bronson, Zach Amsden, George Cabrera, et al.: “TAO: Facebook’s Distributed Data Store for the Social Graph,” at USENIX Annual Technical Conference (USENIX ATC), June 2013.

[ 36 ]“ Apache TinkerPop3.2.3 文档”,tinkerpop.apache.org,2016 年 10 月。

[36] “Apache TinkerPop3.2.3 Documentation,” tinkerpop.apache.org, October 2016.

[ 37 ]“ Neo4j 手册 v2.0.0 ”,Neo 技术,2013 年。

[37] “The Neo4j Manual v2.0.0,” Neo Technology, 2013.

[ 38 ] Emil Eifrem: Twitter 通讯,2014 年 1 月 3 日。

[38] Emil Eifrem: Twitter correspondence, January 3, 2014.

[ 39 ] David Beckett 和 Tim Berners-Lee:“ Turtle – Terse RDF Triple Language ”,W3C 团队提交,2011 年 3 月 28 日。

[39] David Beckett and Tim Berners-Lee: “Turtle – Terse RDF Triple Language,” W3C Team Submission, March 28, 2011.

[ 40 ]“ Datomic 开发资源”,Metadata Partners, LLC,2013 年。

[40] “Datomic Development Resources,” Metadata Partners, LLC, 2013.

[ 41 ] W3C RDF 工作组:“资源描述框架 (RDF) ”, w3.org,2004 年 2 月 10 日。

[41] W3C RDF Working Group: “Resource Description Framework (RDF),” w3.org, 10 February 2004.

[ 42 ]“ Apache Jena ”,Apache 软件基金会。

[42] “Apache Jena,” Apache Software Foundation.

[ 43 ] Steve Harris、Andy Seaborne 和 Eric Prud'hommeaux:“ SPARQL 1.1 查询语言”,W3C 推荐,2013 年 3 月。

[43] Steve Harris, Andy Seaborne, and Eric Prud’hommeaux: “SPARQL 1.1 Query Language,” W3C Recommendation, March 2013.

[ 44 ] Todd J. Green、Shan Shan Huang、Boon Thau Loo 和 Wenchao Zhou:“数据记录和递归查询处理”,数据库基础与趋势,第 5 卷,第 2 期,第 105-195 页,2013 年 11 月 。 10.1561/1900000017

[44] Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou: “Datalog and Recursive Query Processing,” Foundations and Trends in Databases, volume 5, number 2, pages 105–195, November 2013. doi:10.1561/1900000017

[ 45 ] Stefano Ceri、Georg Gottlob 和 Letizia Tanca:“关于 Datalog 你一直想了解的内容(但从来不敢问) ”,IEEE Transactions on Knowledge and Data Engineering,第 1 卷,第 1 期,第 146-166 页, 1989 年 3 月 。doi:10.1109/69.43410

[45] Stefano Ceri, Georg Gottlob, and Letizia Tanca: “What You Always Wanted to Know About Datalog (And Never Dared to Ask),” IEEE Transactions on Knowledge and Data Engineering, volume 1, number 1, pages 146–166, March 1989. doi:10.1109/69.43410

[ 46 ] Serge Abiteboul、Richard Hull 和 Victor Vianu: 数据库基础。Addison-Wesley,1995。ISBN:978-0-201-53771-0,可在线访问webdam.inria.fr/Alice

[46] Serge Abiteboul, Richard Hull, and Victor Vianu: Foundations of Databases. Addison-Wesley, 1995. ISBN: 978-0-201-53771-0, available online at webdam.inria.fr/Alice

[ 47 ] 内森·马兹:“ Cascalog ”,cascalog.org

[47] Nathan Marz: “Cascalog,” cascalog.org.

[ 48 ] Dennis A. Benson、Ilene Karsch-Mizrachi、David J. Lipman 等人:“ GenBank ”,《 核酸研究》,第 36 卷,数据库问题,D25–D30 页,2007 年 12 月 。doi:10.1093/nar/ gkm929

[48] Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, et al.: “GenBank,” Nucleic Acids Research, volume 36, Database issue, pages D25–D30, December 2007. doi:10.1093/nar/gkm929

[ 49 ] Fons Rademakers:“大数据分析的根源”,大数据管理未来研讨会,英国伦敦,2013 年 6 月。

[49] Fons Rademakers: “ROOT for Big Data Analysis,” at Workshop on the Future of Big Data Management, London, UK, June 2013.

第 3 章存储和检索

Chapter 3. Storage and Retrieval

Wer Ordnung hält,是nur zu faul zum Suchen。

(如果你把东西整理得井井有条,你就懒得去寻找。)

德国谚语

Wer Ordnung hält, ist nur zu faul zum Suchen.

(If you keep things tidily ordered, you’re just too lazy to go searching.)

German proverb

在最基本的层面上,数据库需要做两件事:当你给它一些数据时,它应该存储数据,当你稍后再次请求它时,它应该将数据返回给你。

On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.

第 2 章中,我们讨论了数据模型和查询语言,即您(应用程序开发人员)向数据库提供数据的格式,以及您稍后可以再次请求数据的机制。在本章中,我们从数据库的角度讨论同样的问题:如何存储给定的数据,以及如何在需要时再次找到它。

In Chapter 2 we discussed data models and query languages—i.e., the format in which you (the application developer) give the database your data, and the mechanism by which you can ask for it again later. In this chapter we discuss the same from the database’s point of view: how we can store the data that we’re given, and how we can find it again when we’re asked for it.

作为应用程序开发人员,为什么要关心数据库如何在内部处理存储和检索?您可能不会从头开始实现自己的存储引擎,但您 确实需要从众多可用的存储引擎中选择适合您的应用程序的存储引擎。为了调整存储引擎以在您的工作负载类型上表现良好,您需要大致了解存储引擎在幕后正在做什么。

Why should you, as an application developer, care how the database handles storage and retrieval internally? You’re probably not going to implement your own storage engine from scratch, but you do need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.

特别是,针对事务工作负载优化的存储引擎和针对分析优化的存储引擎之间存在很大差异。我们将在稍后的“事务处理还是分析?”中探讨这种区别。,在“面向列的存储”中,我们将讨论一系列针对分析进行优化的存储引擎。

In particular, there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics. We will explore that distinction later in “Transaction Processing or Analytics?”, and in “Column-Oriented Storage” we’ll discuss a family of storage engines that is optimized for analytics.

然而,首先我们将通过讨论您可能熟悉的数据库类型中使用的存储引擎来开始本章:传统关系数据库,以及大多数所谓的 NoSQL 数据库。我们将研究两个系列的存储引擎:日志结构存储引擎和面向页面的存储引擎(例如 B 树)。

However, first we’ll start this chapter by talking about storage engines that are used in the kinds of databases that you’re probably familiar with: traditional relational databases, and also most so-called NoSQL databases. We will examine two families of storage engines: log-structured storage engines, and page-oriented storage engines such as B-trees.

为数据库提供支持的数据结构

Data Structures That Power Your Database

考虑世界上最简单的数据库,由两个 Bash 函数实现:

Consider the world’s simplest database, implemented as two Bash functions:

#!/bin/bash

db_set () {
    echo "$1,$2">> 数据库
}

db_get () {
    grep"^$1,"数据库|sed -e "s/^$1,//" |tail -n 1
}
#!/bin/bash

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

这两个函数实现了键值存储。您可以调用db_set key value,它将 在数据库中存储key和。value键和值可以是(几乎)您喜欢的任何内容,例如,值可以是 JSON 文档。然后您可以调用db_get key,它会查找与该特定键关联的最新值并返回它。

These two functions implement a key-value store. You can call db_set key value, which will store key and value in the database. The key and value can be (almost) anything you like—for example, the value could be a JSON document. You can then call db_get key, which looks up the most recent value associated with that particular key and returns it.

它有效:

And it works:

$ db_set 123456 '{"name":"伦敦","景点":["大本钟","伦敦眼"]}'

$ db_set 42 '{"name":"旧金山","景点":["金门大桥"]}'

$ db_get 42
{"name":"旧金山","景点":["金门大桥"]}
$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}'

$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

底层存储格式非常简单:一个文本文件,其中每行包含一个键值对,用逗号分隔(大致像 CSV 文件,忽略转义问题)。每次调用都会db_set 追加到文件末尾,因此如果您多次更新某个键,该值的旧版本不会被覆盖 - 您需要查看文件中最后一次出现的键以查找最新值(因此tail -n 1db_get):

The underlying storage format is very simple: a text file where each line contains a key-value pair, separated by a comma (roughly like a CSV file, ignoring escaping issues). Every call to db_set appends to the end of the file, so if you update a key several times, the old versions of the value are not overwritten—you need to look at the last occurrence of a key in a file to find the latest value (hence the tail -n 1 in db_get):

$ db_set 42 '{"name":"旧金山","景点":["探索博物馆"]}'

$ db_get 42
{"name":"旧金山","景点":["探索博物馆"]}

$猫数据库
123456,{"name":"伦敦","景点":["大本钟","伦敦眼"]}
42、{"name":"旧金山","景点":["金门大桥"]}
42,{"name":"旧金山","景点":["探索博物馆"]}
$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}

$ cat database
123456,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}

db_set对于如此简单的事情, 我们的函数实际上具有相当好的性能,因为附加到文件通常非常有效。与此类似db_set,许多数据库内部使用日志,这是一个仅附加的数据文件。真正的数据库有更多的问题需要处理(例如并发控制、回收磁盘空间以使日志不会永远增长、以及处理错误和部分写入的记录),但基本原理是相同的。日志非常有用,我们将在本书的其余部分多次遇到它们。

Our db_set function actually has pretty good performance for something that is so simple, because appending to a file is generally very efficient. Similarly to what db_set does, many databases internally use a log, which is an append-only data file. Real databases have more issues to deal with (such as concurrency control, reclaiming disk space so that the log doesn’t grow forever, and handling errors and partially written records), but the basic principle is the same. Logs are incredibly useful, and we will encounter them several times in the rest of this book.

笔记

日志一词通常用于指应用程序日志,其中应用程序输出描述正在发生的事情的文本。在本书中,日志具有更一般的含义:仅附加的记录序列。它不必是人类可读的;它可能是二进制的,仅供其他程序读取。

The word log is often used to refer to application logs, where an application outputs text that describes what’s happening. In this book, log is used in the more general sense: an append-only sequence of records. It doesn’t have to be human-readable; it might be binary and intended only for other programs to read.

另一方面,db_get如果数据库中有大量记录,我们的函数的性能就会很差。每次想要查找某个键时,db_get都必须从头到尾扫描整个数据库文件,查找该键的出现次数。用算法术语来说,查找的成本为O ( n ):如果将数据库中的记录数n加倍,查找所需的时间就会增加两倍。这不好。

On the other hand, our db_get function has terrible performance if you have a large number of records in your database. Every time you want to look up a key, db_get has to scan the entire database file from beginning to end, looking for occurrences of the key. In algorithmic terms, the cost of a lookup is O(n): if you double the number of records n in your database, a lookup takes twice as long. That’s not good.

为了有效地查找数据库中特定键的值,我们需要不同的数据结构:索引。在本章中,我们将研究一系列索引结构并了解它们的比较;它们背后的总体想法是在侧面保留一些额外的元数据,这些元数据充当路标并帮助您找到所需的数据。如果您想以多种不同的方式搜索相同的数据,则可能需要在数据的不同部分上使用多个不同的索引。

In order to efficiently find the value for a particular key in the database, we need a different data structure: an index. In this chapter we will look at a range of indexing structures and see how they compare; the general idea behind them is to keep some additional metadata on the side, which acts as a signpost and helps you to locate the data you want. If you want to search the same data in several different ways, you may need several different indexes on different parts of the data.

索引是从主要数据派生的附加结构。许多数据库允许添加和删除索引,这不会影响数据库的内容;它只影响查询的性能。维护额外的结构会产生开销,尤其是在写入方面。对于写入,很难超越简单地追加到文件的性能,因为这是最简单的写入操作。任何类型的索引通常都会减慢写入速度,因为每次写入数据时索引也需要更新。

An index is an additional structure that is derived from the primary data. Many databases allow you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest possible write operation. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

这是存储系统中的一个重要权衡:精心选择的索引可以加速读取查询,但每个索引都会减慢写入速度。因此,默认情况下数据库通常不会对所有内容建立索引,而是要求您(应用程序开发人员或数据库管理员)使用您对应用程序典型查询模式的了解来手动选择索引。然后,您可以选择为您的应用程序带来最大好处的索引,而不会引入不必要的开销。

This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes. For this reason, databases don’t usually index everything by default, but require you—the application developer or database administrator—to choose indexes manually, using your knowledge of the application’s typical query patterns. You can then choose the indexes that give your application the greatest benefit, without introducing more overhead than necessary.

哈希索引

Hash Indexes

让我们从键值数据的索引开始。这不是唯一可以索引的数据类型,但它很常见,并且是更复杂索引的有用构建块。

Let’s start with indexes for key-value data. This is not the only kind of data you can index, but it’s very common, and it’s a useful building block for more complex indexes.

键值存储与大多数编程语言中可以找到的字典类型非常相似,通常以哈希映射(哈希表)的形式实现。哈希图在许多算法教科书 [ 1 , 2 ]中都有描述,因此我们在这里不详细介绍它们的工作原理。既然我们已经有了内存数据结构的哈希映射,为什么不使用它们来索引磁盘上的数据呢?

Key-value stores are quite similar to the dictionary type that you can find in most programming languages, and which is usually implemented as a hash map (hash table). Hash maps are described in many algorithms textbooks [1, 2], so we won’t go into detail of how they work here. Since we already have hash maps for our in-memory data structures, why not use them to index our data on disk?

假设我们的数据存储仅包含附加到文件,如前面的示例所示。那么最简单的索引策略就是:在内存中保留一个哈希映射,其中每个键都映射到数据文件中的字节偏移量(可以找到该值的位置),如图 3-1所示。每当您将新的键值对附加到文件时,您还会更新哈希映射以反映刚刚写入的数据的偏移量(这既适用于插入新键,也适用于更新现有键)。当您想要查找某个值时,请使用哈希映射查找数据文件中的偏移量,查找该位置并读取该值。

Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found, as illustrated in Figure 3-1. Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

迪迪亚0301
图 3-1。以类似 CSV 的格式存储键值对日志,并使用内存中的哈希映射进行索引。

这听起来可能很简单,但却是一个可行的方法。事实上,这本质上就是 Bitcask(Riak 中的默认存储引擎)所做的事情 [ 3 ]。Bitcask 提供高性能读取和写入,但要求所有密钥都适合可用 RAM,因为哈希映射完全保存在内存中。这些值可以使用比可用内存更多的空间,因为只需一次磁盘查找即可从磁盘加载它们。如果数据文件的该部分已经在文件系统缓存中,则读取根本不需要任何磁盘 I/O。

This may sound simplistic, but it is a viable approach. In fact, this is essentially what Bitcask (the default storage engine in Riak) does [3]. Bitcask offers high-performance reads and writes, subject to the requirement that all the keys fit in the available RAM, since the hash map is kept completely in memory. The values can use more space than there is available memory, since they can be loaded from disk with just one disk seek. If that part of the data file is already in the filesystem cache, a read doesn’t require any disk I/O at all.

像 Bitcask 这样的存储引擎非常适合每个键的值频繁更新的情况。例如,键可能是猫视频的 URL,值可能是播放次数(每次有人点击播放按钮时都会增加)。在这种工作负载中,有大量写入,但没有太多不同的键 - 每个键都有大量写入,但将所有键保留在内存中是可行的。

A storage engine like Bitcask is well suited to situations where the value for each key is updated frequently. For example, the key might be the URL of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button). In this kind of workload, there are a lot of writes, but there are not too many distinct keys—you have a large number of writes per key, but it’s feasible to keep all keys in memory.

正如到目前为止所描述的,我们只追加到文件中,那么我们如何避免最终耗尽磁盘空间呢?一个好的解决方案是,通过在段文件达到一定大小时关闭该段文件,然后将后续写入写入新的段文件,将日志分成一定大小的段。然后我们可以对这些段进行压缩,如图3-2所示。压缩意味着丢弃日志中的重复键,并仅保留每个键的最新更新。

As described so far, we only ever append to a file—so how do we avoid eventually running out of disk space? A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can then perform compaction on these segments, as illustrated in Figure 3-2. Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.

迪迪亚0302
图 3-2。压缩键值更新日志(计算每个猫视频的播放次数),仅保留每个键的最新值。

此外,由于压缩通常会使段变得更小(假设一个键在一个段内平均被覆盖多次),因此我们还可以在执行压缩的同时将多个段合并在一起,如图3-3所示。段在写入后就不会被修改,因此合并后的段将写入新文件。冻结段的合并和压缩可以在后台线程中完成,并且在进行过程中,我们仍然可以使用旧的段文件继续正常服务读写请求。合并过程完成后,我们将读取请求切换为使用新的合并段而不是旧段,然后可以简单地删除旧段文件。

Moreover, since compaction often makes segments much smaller (assuming that a key is overwritten several times on average within one segment), we can also merge several segments together at the same time as performing the compaction, as shown in Figure 3-3. Segments are never modified after they have been written, so the merged segment is written to a new file. The merging and compaction of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files. After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments—and then the old segment files can simply be deleted.

直达0303
图 3-3。同时执行压缩和段合并。

现在,每个段都有自己的内存中哈希表,将键映射到文件偏移量。为了找到键的值,我们首先检查最近段的哈希图;如果密钥不存在,我们将检查第二个最近的段,依此类推。合并过程使段数量保持较小,因此查找不需要检查许多哈希映射。

Each segment now has its own in-memory hash table, mapping keys to file offsets. In order to find the value for a key, we first check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on. The merging process keeps the number of segments small, so lookups don’t need to check many hash maps.

要使这个简单的想法付诸实践需要很多细节。简而言之,实际实施中一些重要的问题是:

Lots of detail goes into making this simple idea work in practice. Briefly, some of the issues that are important in a real implementation are:

文件格式
File format

CSV 不是日志的最佳格式。使用二进制格式更快更简单,首先以字节为单位编码字符串的长度,然后是原始字符串(无需转义)。

CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).

删除记录
Deleting records

如果要删除某个键及其关联值,则必须将特殊的删除记录附加到数据文件(有时称为逻辑删除。合并日志段时,逻辑删除会告诉合并过程丢弃已删除键的任何先前值。

If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.

崩溃恢复
Crash recovery

如果数据库重新启动,内存中的哈希映射将丢失。原则上,您可以通过从头到尾读取整个段文件并记下每个键的最新值的偏移量来恢复每个段的哈希映射。但是,如果段文件很大,这可能需要很长时间,这会使服务器重新启动变得痛苦。 Bitcask 通过在磁盘上存储每个段的哈希图的快照来加速恢复,可以更快地将其加载到内存中。

If the database is restarted, the in-memory hash maps are lost. In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along. However, that might take a long time if the segment files are large, which would make server restarts painful. Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.

部分书面记录
Partially written records

数据库可能随时崩溃,包括在将记录追加到日志的过程中。Bitcask 文件包含校验和,允许检测并忽略日志中此类损坏的部分。

The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums, allowing such corrupted parts of the log to be detected and ignored.

并发控制
Concurrency control

由于写入操作是按照严格的顺序附加到日志中的,因此一种常见的实现选择是只有一个写入器线程。数据文件段是仅追加的且不可变的,因此它们可以由多个线程同时读取。

As writes are appended to the log in a strictly sequential order, a common implementation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by multiple threads.

乍一看,仅附加日志似乎很浪费:为什么不就地更新文件,用新值覆盖旧值?但事实证明,仅附加设计是好的,原因如下:

An append-only log seems wasteful at first glance: why don’t you update the file in place, overwriting the old value with the new value? But an append-only design turns out to be good for several reasons:

  • 追加和段合并是顺序写入操作,通常比随机写入快得多,特别是在磁旋转磁盘硬盘驱动器上。在某种程度上,顺序写入也更适合基于闪存的固态驱动器(SSD)[ 4 ]。我们将在“比较 B 树和 LSM 树”中进一步讨论这个问题。

  • Appending and segment merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives. To some extent sequential writes are also preferable on flash-based solid state drives (SSDs) [4]. We will discuss this issue further in “Comparing B-Trees and LSM-Trees”.

  • 如果段文件是仅追加的或不可变的,那么并发和崩溃恢复就会简单得多。例如,您不必担心在覆盖值时发生崩溃的情况,从而留下一个包含部分旧值和部分新值拼接在一起的文件。

  • Concurrency and crash recovery are much simpler if segment files are append-only or immutable. For example, you don’t have to worry about the case where a crash happened while a value was being overwritten, leaving you with a file containing part of the old and part of the new value spliced together.

  • 合并旧段可以避免数据文件随着时间的推移而变得碎片化的问题。

  • Merging old segments avoids the problem of data files getting fragmented over time.

但是,哈希表索引也有局限性:

However, the hash table index also has limitations:

  • 哈希表必须适合内存,因此如果您有大量键,那么您就不走运了。原则上,您可以在磁盘上维护哈希映射,但不幸的是很难使磁盘上的哈希映射表现良好。它需要大量的随机访问 I/O,当它变满时增长的成本很高,并且哈希冲突需要复杂的逻辑 [ 5 ]。

  • The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck. In principle, you could maintain a hash map on disk, but unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require fiddly logic [5].

  • 范围查询效率不高。例如,您无法轻松扫描 和 之间的所有键kitty00000-kitty99999您必须在哈希映射中单独查找每个键。

  • Range queries are not efficient. For example, you cannot easily scan over all keys between kitty00000 and kitty99999—you’d have to look up each key individually in the hash maps.

在下一节中,我们将研究没有这些限制的索引结构。

In the next section we will look at an indexing structure that doesn’t have those limitations.

SSTable 和 LSM 树

SSTables and LSM-Trees

图3-3中,每个日志结构存储段都是一个键值对序列。这些对按照写入的顺序出现,日志中较后的值优先于日志中较早的相同键的值。除此之外,文件中键值对的顺序并不重要。

In Figure 3-3, each log-structured storage segment is a sequence of key-value pairs. These pairs appear in the order that they were written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file does not matter.

现在我们可以对段文件的格式进行简单的更改:我们要求键值对的序列按 key排序。乍一看,这个要求似乎破坏了我们使用顺序写入的能力,但我们稍后会讨论这个问题。

Now we can make a simple change to the format of our segment files: we require that the sequence of key-value pairs is sorted by key. At first glance, that requirement seems to break our ability to use sequential writes, but we’ll get to that in a moment.

我们将这种格式称为“排序字符串表”,简称SSTable 。我们还要求每个键在每个合并的段文件中仅出现一次(压缩过程已经确保了这一点)。与具有哈希索引的日志段相比,SSTable 具有几大优势:

We call this format Sorted String Table, or SSTable for short. We also require that each key only appears once within each merged segment file (the compaction process already ensures that). SSTables have several big advantages over log segments with hash indexes:

  1. 即使文件大于可用内存,合并段也是简单而高效的。该方法类似于归并排序算法中使用的方法,如图 3-4所示 :开始并排读取输入文件,查看每个文件中的第一个键,复制最低的键(根据排序顺序)到输出文件,然后重复。这会生成一个新的合并段文件,也按键排序。

    迪迪亚0304
    图 3-4。合并多个 SSTable 段,仅保留每个键的最新值。

    如果同一个键出现在多个输入段中怎么办?请记住,每个段都包含在某个时间段内写入数据库的所有值。这意味着一个输入段中的所有值必须比另一段中的所有值更新(假设我们始终合并相邻段)。当多个段包含相同的键时,我们可以保留最近段中的值并丢弃旧段中的值。

  2. Merging segments is simple and efficient, even if the files are bigger than the available memory. The approach is like the one used in the mergesort algorithm and is illustrated in Figure 3-4: you start reading the input files side by side, look at the first key in each file, copy the lowest key (according to the sort order) to the output file, and repeat. This produces a new merged segment file, also sorted by key.

    Figure 3-4. Merging several SSTable segments, retaining only the most recent value for each key.

    What if the same key appears in several input segments? Remember that each segment contains all the values written to the database during some period of time. This means that all the values in one input segment must be more recent than all the values in the other segment (assuming that we always merge adjacent segments). When multiple segments contain the same key, we can keep the value from the most recent segment and discard the values in older segments.

  3. 为了在文件中查找特定键,您不再需要在内存中保留所有键的索引。请参见图3-5的示例:假设您正在查找 key handiwork,但您不知道该 key 在段文件中的确切偏移量。然而,您确实知道“keys handbag ”和“handicraft”的偏移量,并且由于排序,您知道“ handiwork”必须出现在这两者之间。这意味着您可以跳转到手提包的偏移量并从那里开始扫描,直到找到手工品(或者如果密钥不存在于文件中,则找不到)。

    迪迪亚0305
    图 3-5。具有内存索引的 SSTable。

    您仍然需要一个内存中索引来告诉您某些键的偏移量,但它可以是稀疏的:每几千字节的段文件一个键就足够了,因为可以很快扫描几千字节。

  4. In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. See Figure 3-5 for an example: say you’re looking for the key handiwork, but you don’t know the exact offset of that key in the segment file. However, you do know the offsets for the keys handbag and handsome, and because of the sorting you know that handiwork must appear between those two. This means you can jump to the offset for handbag and scan from there until you find handiwork (or not, if the key is not present in the file).

    Figure 3-5. An SSTable with an in-memory index.

    You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key for every few kilobytes of segment file is sufficient, because a few kilobytes can be scanned very quickly.i

  5. 由于读取请求无论如何都需要扫描请求范围内的多个键值对,因此可以将这些记录分组到一个块中并在将其写入磁盘之前进行压缩(如图 3-5 中的阴影区域所示。然后,稀疏内存索引的每个条目都指向压缩块的开头。除了节省磁盘空间之外,压缩还可以减少 I/O 带宽的使用。

  6. Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk (indicated by the shaded area in Figure 3-5). Each entry of the sparse in-memory index then points at the start of a compressed block. Besides saving disk space, compression also reduces the I/O bandwidth use.

构建和维护 SSTable

Constructing and maintaining SSTables

到目前为止还不错,但是首先如何让数据按键排序呢?我们传入的写入可以按任何顺序发生。

Fine so far—but how do you get your data to be sorted by key in the first place? Our incoming writes can occur in any order.

在磁盘上维护排序结构是可能的(请参阅“B 树”),但在内存中维护它要容易得多。您可以使用许多众所周知的树数据结构,例如红黑树或 AVL 树 [ 2 ]。使用这些数据结构,您可以按任意顺序插入键并按排序顺序读回它们。

Maintaining a sorted structure on disk is possible (see “B-Trees”), but maintaining it in memory is much easier. There are plenty of well-known tree data structures that you can use, such as red-black trees or AVL trees [2]. With these data structures, you can insert keys in any order and read them back in sorted order.

现在我们可以让我们的存储引擎按如下方式工作:

We can now make our storage engine work as follows:

  • 当写入进来时,将其添加到内存中的平衡树数据结构(例如红黑树)。这种内存树有时称为memtable

  • When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.

  • 当内存表大于某个阈值(通常是几兆字节)时,将其作为 SSTable 文件写入磁盘。这可以高效地完成,因为树已经维护了按键排序的键值对。新的 SSTable 文件成为数据库的最新段。当 SSTable 被写入磁盘时,写入可以继续到新的 memtable 实例。

  • When the memtable gets bigger than some threshold—typically a few megabytes—write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.

  • 为了服务读取请求,首先尝试在内存表中查找键,然后在最近的磁盘段中查找键,然后在下一个较旧的段中查找键,依此类推。

  • In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.

  • 有时,在后台运行合并和压缩过程以合并段文件并丢弃覆盖或删除的值。

  • From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

这个方案效果很好。它只存在一个问题:如果数据库崩溃,最近的写入(位于内存表中但尚未写入磁盘)就会丢失。为了避免这个问题,我们可以在磁盘上保留一个单独的日志,每次写入都会立即附加到该日志,就像上一节一样。该日志未按排序顺序排列,但这并不重要,因为它的唯一目的是在崩溃后恢复内存表。每次将memtable写入SSTable时,相应的日志都会被丢弃。

This scheme works very well. It only suffers from one problem: if the database crashes, the most recent writes (which are in the memtable but not yet written out to disk) are lost. In order to avoid that problem, we can keep a separate log on disk to which every write is immediately appended, just like in the previous section. That log is not in sorted order, but that doesn’t matter, because its only purpose is to restore the memtable after a crash. Every time the memtable is written out to an SSTable, the corresponding log can be discarded.

用 SSTables 制作 LSM 树

Making an LSM-tree out of SSTables

这里描述的算法本质上是 LevelDB [ 6 ] 和 RocksDB [ 7 ] 中使用的算法,这两个键值存储引擎库旨在嵌入到其他应用程序中。除此之外,LevelDB 可以在 Riak 中用作 Bitcask 的替代品。Cassandra 和 HBase [ 8 ]中使用了类似的存储引擎,它们都受到 Google 的 Bigtable 论文 [ 9 ](引入了术语SSTablememtable)的启发。

The algorithm described here is essentially what is used in LevelDB [6] and RocksDB [7], key-value storage engine libraries that are designed to be embedded into other applications. Among other things, LevelDB can be used in Riak as an alternative to Bitcask. Similar storage engines are used in Cassandra and HBase [8], both of which were inspired by Google’s Bigtable paper [9] (which introduced the terms SSTable and memtable).

最初,这种索引结构是由 Patrick O'Neil 等人描述的。名为 Log-Structured Merge-Tree(或 LSM-Tree)[ 10 ],建立在日志结构文件系统的早期工作基础上[ 11 ]。基于这种合并和压缩排序文件原理的存储引擎通常称为 LSM 存储引擎。

Originally this indexing structure was described by Patrick O’Neil et al. under the name Log-Structured Merge-Tree (or LSM-Tree) [10], building on earlier work on log-structured filesystems [11]. Storage engines that are based on this principle of merging and compacting sorted files are often called LSM storage engines.

Lucene 是 Elasticsearch 和 Solr 使用的全文搜索索引引擎,它使用类似的方法来存储其术语词典 [ 12 , 13 ]。全文索引比键值索引复杂得多,但基于类似的想法:给定搜索查询中的单词,查找提及该单词的所有文档(网页、产品描述等)。这是通过键值结构实现的,其中键是单词(术语,值是包含该单词的所有文档的 ID 列表(发布列表)。在 Lucene 中,从术语到发布列表的映射保存在类似 SSTable 的排序文件中,这些文件根据需要在后台合并[ 14 ]。

Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary [12, 13]. A full-text index is much more complex than a key-value index but is based on a similar idea: given a word in a search query, find all the documents (web pages, product descriptions, etc.) that mention the word. This is implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list). In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed [14].

性能优化

Performance optimizations

与往常一样,要使存储引擎在实践中表现良好,需要考虑很多细节。例如,LSM 树算法在查找数据库中不存在的键时可能会很慢:您必须检查内存表,然后将段一直追溯到最旧的段(可能必须从磁盘读取每个段) ),然后才能确定该密钥不存在。为了优化这种访问,存储引擎通常使用额外的布隆过滤器 [ 15 ]。(布隆过滤器是一种节省内存的数据结构,用于近似集合的内容。它可以告诉您某个键是否未出现在数据库中,从而节省了对不存在键的许多不必要的磁盘读取。)

As always, a lot of detail goes into making a storage engine perform well in practice. For example, the LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you have to check the memtable, then the segments all the way back to the oldest (possibly having to read from disk for each one) before you can be sure that the key does not exist. In order to optimize this kind of access, storage engines often use additional Bloom filters [15]. (A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.)

还有不同的策略来确定 SSTable 压缩和合并的顺序和时间。最常见的选项是大小分层分级压缩。LevelDB 和 RocksDB 使用分级压缩(LevelDB 的名称由此而来),HBase 使用大小分层,而 Cassandra 两者都支持 [ 16 ]。在大小分层压缩中,较新且较小的 SSTable 会依次合并到较旧且较大的 SSTable 中。在分级压缩中,键范围被分割成更小的 SSTable,而较旧的数据被移动到单独的“级别”中,这使得压缩能够更加增量地进行并使用更少的磁盘空间。

There are also different strategies to determine the order and timing of how SSTables are compacted and merged. The most common options are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction (hence the name of LevelDB), HBase uses size-tiered, and Cassandra supports both [16]. In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables. In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space.

尽管存在许多微妙之处,但 LSM 树的基本思想(保持在后台合并的级联 SSTable)简单而有效。即使数据集比可用内存大得多,它仍然可以正常工作。由于数据按排序顺序存储,因此您可以有效地执行范围查询(扫描高于某个最小值和某个最大值的所有键),并且由于磁盘写入是连续的,因此 LSM 树可以支持非常高的写入吞吐量。

Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background—is simple and effective. Even when the dataset is much bigger than the available memory it continues to work well. Since data is stored in sorted order, you can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.

B树

B-Trees

到目前为止,我们讨论的日志结构索引正在获得认可,但它们并不是最常见的索引类型。最广泛使用的索引结构有很大不同:B 树

The log-structured indexes we have discussed so far are gaining acceptance, but they are not the most common type of index. The most widely used indexing structure is quite different: the B-tree.

B 树 于 1970 年推出[ 17 ],不到 10 年就被称为“无处不在”[ 18 ],它很好地经受住了时间的考验。它们仍然是几乎所有关系数据库中的标准索引实现,许多非关系数据库也使用它们。

Introduced in 1970 [17] and called “ubiquitous” less than 10 years later [18], B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many nonrelational databases use them too.

与 SSTable 一样,B 树保持按键排序的键值对,这允许高效的键值查找和范围查询。但相似之处仅此而已:B 树具有非常不同的设计理念。

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups and range queries. But that’s where the similarity ends: B-trees have a very different design philosophy.

我们之前看到的日志结构索引将数据库分解为可变大小的,通常大小为几兆或更大,并且总是按顺序写入段。相比之下,B 树将数据库分解为固定大小的,通常大小为 4 KB(有时更大),并且一次读取或写入一页。这种设计更符合底层硬件,因为磁盘也排列在固定大小的块中。

The log-structured indexes we saw earlier break the database down into variable-size segments, typically several megabytes or more in size, and always write a segment sequentially. By contrast, B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

每个页面都可以使用地址或位置来标识,这允许一个页面引用另一个页面——类似于指针,但在磁盘上而不是在内存中。我们可以使用这些页面引用来构建页面树,如图3-6所示。

Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory. We can use these page references to construct a tree of pages, as illustrated in Figure 3-6.

迪迪亚0306
图 3-6。使用 B 树索引查找键。

将一页指定为B 树的根;每当您想在索引中查找某个键时,都可以从这里开始。该页面包含多个键和对子页面的引用。每个子项负责一个连续的键范围,并且引用之间的键指示这些范围之间的边界所在。

One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. The page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

在图 3-6的示例中,我们正在寻找键 251,因此我们知道我们需要遵循边界 200 和 300 之间的页面引用。这将我们带到一个外观相似的页面,该页面进一步分解了 200 –300 个范围划分为子范围。最终我们到达包含各个键的页面(叶页面),该页面要么包含每个内联键的值,要么包含对可以找到这些值的页面的引用。

In the example in Figure 3-6, we are looking for the key 251, so we know that we need to follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking page that further breaks down the 200–300 range into subranges. Eventually we get down to a page containing individual keys (a leaf page), which either contains the value for each key inline or contains references to the pages where the values can be found.

B 树一页中对子页的引用数量称为分支因子。例如,在图 3-6中,分支因子为 6。实际上,分支因子取决于存储页面引用和范围边界所需的空间量,但通常为数百。

The number of references to child pages in one page of the B-tree is called the branching factor. For example, in Figure 3-6 the branching factor is six. In practice, the branching factor depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.

如果要更新 B 树中现有键的值,请搜索包含该键的叶页,更改该页中的值,然后将该页写回磁盘(对该页的任何引用仍然有效) 。如果要添加新键,则需要找到范围包含新键的页面并将其添加到该页面。如果页面中没有足够的可用空间来容纳新键,则会将其分为两个半满页面,并更新父页面以适应键范围的新细分 - 请参见图 3-7二、

If you want to update the value for an existing key in a B-tree, you search for the leaf page containing that key, change the value in that page, and write the page back to disk (any references to that page remain valid). If you want to add a new key, you need to find the page whose range encompasses the new key and add it to that page. If there isn’t enough free space in the page to accommodate the new key, it is split into two half-full pages, and the parent page is updated to account for the new subdivision of key ranges—see Figure 3-7.ii

直达0307
图 3-7。通过拆分页面来增长 B 树。

该算法确保树保持平衡:具有n 个键的 B 树的深度始终为O (log  n )。大多数数据库都可以放入三层或四层深度的 B 树中,因此您无需遵循许多页面引用即可找到您要查找的页面。(分支因子为 500 的 4 KB 页面的四级树最多可存储 256 TB。)

This algorithm ensures that the tree remains balanced: a B-tree with n keys always has a depth of O(log n). Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are looking for. (A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.)

使 B 树可靠

Making B-trees reliable

B 树的基本底层写入操作是用新数据覆盖磁盘上的页面。假设覆盖不会改变页面的位置;即,当该页面被覆盖时,对该页面的所有引用都保持不变。这与 LSM 树等日志结构索引形成鲜明对比,后者仅附加到文件(并最终删除过时的文件),但从不修改文件。

The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. It is assumed that the overwrite does not change the location of the page; i.e., all references to that page remain intact when the page is overwritten. This is in stark contrast to log-structured indexes such as LSM-trees, which only append to files (and eventually delete obsolete files) but never modify files in place.

您可以将覆盖磁盘上的页面视为实际的硬件操作。在磁性硬盘驱动器上,这意味着将磁盘头移动到正确的位置,等待旋转盘片上的正确位置出现,然后用新数据覆盖适当的扇区。在 SSD 上,发生的情况有些复杂,因为 SSD 必须一次擦除和重写存储芯片中相当大的块 [ 19 ]。

You can think of overwriting a page on disk as an actual hardware operation. On a magnetic hard drive, this means moving the disk head to the right place, waiting for the right position on the spinning platter to come around, and then overwriting the appropriate sector with new data. On SSDs, what happens is somewhat more complicated, due to the fact that an SSD must erase and rewrite fairly large blocks of a storage chip at a time [19].

此外,某些操作需要覆盖多个不同的页面。例如,如果由于插入导致页面满而拆分页面,则需要写入拆分的两个页面,并覆盖其父页面以更新对两个子页面的引用。这是一个危险的操作,因为如果数据库在仅写入部分页面后崩溃,最终会得到损坏的索引(例如,可能有一个孤立页面,它不是任何父页面的子页面)。

Moreover, some operations require several different pages to be overwritten. For example, if you split a page because an insertion caused it to be overfull, you need to write the two pages that were split, and also overwrite their parent page to update the references to the two child pages. This is a dangerous operation, because if the database crashes after only some of the pages have been written, you end up with a corrupted index (e.g., there may be an orphan page that is not a child of any parent).

为了使数据库能够抵御崩溃,B 树实现通常在磁盘上包含一个附加数据结构:预写日志(WAL,也称为重做日志)。这是一个仅附加文件,必须先将每个 B 树修改写入该文件,然后才能将其应用于树本身的页面。当数据库在崩溃后恢复时,此日志用于将 B 树恢复到一致状态 [ 5 , 20 ]。

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state [5, 20].

就地更新页面的另一个复杂之处是,如果多个线程要同时访问 B 树,则需要仔细的并发控制,否则线程可能会看到树处于不一致的状态。这通常是通过使用锁存器(轻量级锁)保护树的数据结构来完成的。日志结构的方法在这方面更简单,因为它们在后台进行所有合并,而不会干扰传入的查询,并且时不时以原子方式将旧段交换为新段。

An additional complication of updating pages in place is that careful concurrency control is required if multiple threads are going to access the B-tree at the same time—otherwise a thread may see the tree in an inconsistent state. This is typically done by protecting the tree’s data structures with latches (lightweight locks). Log-structured approaches are simpler in this regard, because they do all the merging in the background without interfering with incoming queries and atomically swap old segments for new segments from time to time.

B 树优化

B-tree optimizations

由于 B 树已经存在了很长时间,因此多年来开发出许多优化也就不足为奇了。仅举几例:

As B-trees have been around for so long, it’s not surprising that many optimizations have been developed over the years. To mention just a few:

  • 一些数据库(如 LMDB)使用写时复制方案 [ 21 ],而不是覆盖页面并维护 WAL 以进行崩溃恢复。修改后的页面将写入不同的位置,并创建树中父页面的新版本,指向新位置。这种方法对于并发控制也很有用,正如我们将在“快照隔离和可重复读取”中看到的那样。

  • Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme [21]. A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control, as we shall see in “Snapshot Isolation and Repeatable Read”.

  • 我们可以通过不存储整个密钥而是缩写它来节省页面空间。特别是在树内部的页面中,键只需要提供足够的信息即可充当键范围之间的边界。将更多的键打包到一个页面中可以使树具有更高的分支因子,从而减少级别。三、

  • We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages on the interior of the tree, keys only need to provide enough information to act as boundaries between key ranges. Packing more keys into a page allows the tree to have a higher branching factor, and thus fewer levels.iii

  • 一般来说,页面可以放置在磁盘上的任何位置;没有什么要求具有附近键范围的页面位于磁盘上附近。如果查询需要按排序顺序扫描键范围的大部分,则逐页布局可能效率低下,因为读取的每个页面可能都需要磁盘查找。因此,许多 B 树实现尝试对树进行布局,以便叶页按顺序出现在磁盘上。然而,随着树的生长,很难维持这种顺序。相比之下,由于 LSM 树在合并过程中一次性重写了大部分存储,因此它们更容易在磁盘上保持顺序键彼此靠近。

  • In general, pages can be positioned anywhere on disk; there is nothing requiring pages with nearby key ranges to be nearby on disk. If a query needs to scan over a large part of the key range in sorted order, that page-by-page layout can be inefficient, because a disk seek may be required for every page that is read. Many B-tree implementations therefore try to lay out the tree so that leaf pages appear in sequential order on disk. However, it’s difficult to maintain that order as the tree grows. By contrast, since LSM-trees rewrite large segments of the storage in one go during merging, it’s easier for them to keep sequential keys close to each other on disk.

  • 其他指针已添加到树中。例如,每个叶页面可以引用其左侧和右侧的兄弟页面,这允许按顺序扫描键而无需跳回父页面。

  • Additional pointers have been added to the tree. For example, each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.

  • B 树变体,例如分形树 [ 22 ],借用了一些日志结构的思想来减少磁盘寻道(它们与分形无关)。

  • B-tree variants such as fractal trees [22] borrow some log-structured ideas to reduce disk seeks (and they have nothing to do with fractals).

比较 B 树和 LSM 树

Comparing B-Trees and LSM-Trees

尽管 B 树实现通常比 LSM 树实现更成熟,但 LSM 树由于其性能特征也很有趣。根据经验,LSM 树通常写入速度更快,而 B 树被认为读取速度更快 [ 23 ]。LSM 树上的读取通常较慢,因为它们必须在不同的压缩阶段检查几种不同的数据结构和 SSTable。

Even though B-tree implementations are generally more mature than LSM-tree implementations, LSM-trees are also interesting due to their performance characteristics. As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads [23]. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction.

然而,基准测试通常是不确定的,并且对工作负载的细节很敏感。您需要使用特定的工作负载测试系统,以便进行有效的比较。在本节中,我们将简要讨论测量存储引擎性能时值得考虑的一些事项。

However, benchmarks are often inconclusive and sensitive to details of the workload. You need to test systems with your particular workload in order to make a valid comparison. In this section we will briefly discuss a few things that are worth considering when measuring the performance of a storage engine.

LSM树的优点

Advantages of LSM-trees

B 树索引必须将每条数据至少写入两次:一次写入预写日志,一次写入树页面本身(也许在页面拆分时再次写入)。即使该页面中只有几个字节发生变化,也必须一次写入整个页面,从而产生开销。一些存储引擎甚至会覆盖同一页面两次,以避免在发生电源故障时最终得到部分更新的页面 [ 24 , 25 ]。

A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the tree page itself (and perhaps again as pages are split). There is also overhead from having to write an entire page at a time, even if only a few bytes in that page changed. Some storage engines even overwrite the same page twice in order to avoid ending up with a partially updated page in the event of a power failure [24, 25].

由于SSTables的重复压缩和合并,日志结构索引也会多次重写数据。这种效应(对数据库的一次写入会导致在数据库的生命周期内对磁盘进行多次写入)称为写入放大。对于 SSD 来说,这一点尤其令人担忧,因为 SSD 在磨损之前只能覆盖有限次数的块。

Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables. This effect—one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime—is known as write amplification. It is of particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.

在写入密集型应用程序中,性能瓶颈可能是数据库写入磁盘的速率。在这种情况下,写放大会产生直接的性能成本:存储引擎向磁盘写入的数据越多,在可用磁盘带宽内每秒可以处理的写入次数就越少。

In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk. In this case, write amplification has a direct performance cost: the more that a storage engine writes to disk, the fewer writes per second it can handle within the available disk bandwidth.

此外,LSM 树通常能够维持比 B 树更高的写入吞吐量,部分原因是它们有时具有较低的写入放大(尽管这取决于存储引擎配置和工作负载),部分原因是它们顺序写入紧凑的 SSTable 文件而不是必须覆盖树中的几个页面[ 26 ]。这种差异对于磁性硬盘驱动器尤其重要,其中顺序写入比随机写入快得多。

Moreover, LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree [26]. This difference is particularly important on magnetic hard drives, where sequential writes are much faster than random writes.

LSM 树可以更好地压缩,因此通常在磁盘上生成比 B 树更小的文件。B 树存储引擎由于碎片而留下一些未使用的磁盘空间:当页面被分割或行无法放入现有页面时,页面中的一些空间仍然未使用。由于 LSM 树不是面向页面的,并且会定期重写 SSTable 以消除碎片,因此它们的存储开销较低,特别是在使用分层压缩时[ 27 ]。

LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees. B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page, some space in a page remains unused. Since LSM-trees are not page-oriented and periodically rewrite SSTables to remove fragmentation, they have lower storage overheads, especially when using leveled compaction [27].

在许多SSD上,固件内部使用日志结构算法将随机写入转换为底层存储芯片上的顺序写入,因此存储引擎写入模式的影响不太明显[19 ]。然而,较低的写入放大和减少的碎片对于 SSD 仍然是有利的:更紧凑地表示数据,允许在可用 I/O 带宽内发出更多的读写请求。

On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine’s write pattern is less pronounced [19]. However, lower write amplification and reduced fragmentation are still advantageous on SSDs: representing data more compactly allows more read and write requests within the available I/O bandwidth.

LSM 树的缺点

Downsides of LSM-trees

日志结构存储的一个缺点是压缩过程有时会干扰正在进行的读取和写入的性能。尽管存储引擎尝试增量执行压缩且不影响并发访问,但磁盘资源有限,因此很容易发生请求需要等待磁盘完成昂贵的压缩操作的情况。对吞吐量和平均响应时间的影响通常很小,但在较高的百分位(请参阅“描述性能”)时,对日志结构存储引擎的查询响应时间有时可能会相当高,并且 B 树可以更具可预测性 [ 28] ]。

A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes. Even though storage engines try to perform compaction incrementally and without affecting concurrent access, disks have limited resources, so it can easily happen that a request needs to wait while the disk finishes an expensive compaction operation. The impact on throughput and average response time is usually small, but at higher percentiles (see “Describing Performance”) the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable [28].

高写入吞吐量时会出现压缩的另一个问题:磁盘的有限写入带宽需要在初始写入(将内存表记录并刷新到 磁盘)和后台运行的压缩线程之间共享。当写入空数据库时,初始写入可以使用完整的磁盘带宽,但是数据库越大,压缩所需的磁盘带宽就越大。

Another issue with compaction arises at high write throughput: the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background. When writing to an empty database, the full disk bandwidth can be used for the initial write, but the bigger the database gets, the more disk bandwidth is required for compaction.

如果写入吞吐量很高并且没有仔细配置压缩,则可能会发生压缩无法跟上传入写入速率的情况。在这种情况下,磁盘上未合并的段的数量不断增长,直到磁盘空间耗尽,并且读取也会变慢,因为它们需要检查更多的段文件。通常,基于 SSTable 的存储引擎不会限制传入写入的速率,即使压缩无法跟上,因此您需要显式监视来检测这种情况 [ 29 , 30 ]。

If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes. In this case, the number of unmerged segments on disk keeps growing until you run out of disk space, and reads also slow down because they need to check more segment files. Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation [29, 30].

B 树的优点是每个键仅存在于索引中的一个位置,而日志结构存储引擎可能在不同段中具有同一键的多个副本。这一点使得 B 树对于想要提供强大事务语义的数据库很有吸引力:在许多关系数据库中,事务隔离是使用键范围上的锁来实现的,而在 B 树索引中,这些锁可以直接附加到树上[ 5 ]。在 第七章中,我们将更详细地讨论这一点。

An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree [5]. In Chapter 7 we will discuss this point in more detail.

B 树在数据库架构中根深蒂固,可为许多工作负载提供始终如一的良好性能,因此它们不太可能很快消失。在新的数据存储中,日志结构索引变得越来越流行。没有快速简单的规则来确定哪种类型的存储引擎更适合您的用例,因此值得根据经验进行测试。

B-trees are very ingrained in the architecture of databases and provide consistently good performance for many workloads, so it’s unlikely that they will go away anytime soon. In new datastores, log-structured indexes are becoming increasingly popular. There is no quick and easy rule for determining which type of storage engine is better for your use case, so it is worth testing empirically.

其他索引结构

Other Indexing Structures

到目前为止我们只讨论了键值索引,它就像关系模型中的主键索引。主键唯一标识关系表中的一行、文档数据库中的一个文档或图形数据库中的一个顶点。数据库中的其他记录可以通过其主键(或ID)引用该行/文档/顶点,并且索引用于解析此类引用。

So far we have only discussed key-value indexes, which are like a primary key index in the relational model. A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database. Other records in the database can refer to that row/document/vertex by its primary key (or ID), and the index is used to resolve such references.

二级索引 也很常见。在关系数据库中,您可以使用该命令在同一个表上创建多个二级索引CREATE INDEX,它们通常对于高效执行联接至关重要。例如,在第2 章的图 2-1中 ,您很可能在列上有一个二级索引,以便您可以在每个表中找到属于同一用户的所有行。user_id

It is also very common to have secondary indexes. In relational databases, you can create several secondary indexes on the same table using the CREATE INDEX command, and they are often crucial for performing joins efficiently. For example, in Figure 2-1 in Chapter 2 you would most likely have a secondary index on the user_id columns so that you can find all the rows belonging to the same user in each of the tables.

二级索引可以轻松地从键值索引构建。主要区别在于键不是唯一的;即,可能有许多行(文档、顶点)具有相同的键。这可以通过两种方式解决:要么使索引中的每个值成为匹配行标识符的列表(如全文索引中的倒排列表),要么通过向每个键附加行标识符来使每个键唯一。无论哪种方式,B 树和日志结构索引都可以用作二级索引。

A secondary index can easily be constructed from a key-value index. The main difference is that keys are not unique; i.e., there might be many rows (documents, vertices) with the same key. This can be solved in two ways: either by making each value in the index a list of matching row identifiers (like a postings list in a full-text index) or by making each key unique by appending a row identifier to it. Either way, both B-trees and log-structured indexes can be used as secondary indexes.

在索引中存储值

Storing values within the index

索引中的键是查询搜索的内容,但值可以是以下两种内容之一:它可以是有问题的实际行(文档、顶点),也可以是对存储在其他地方的行的引用。在后一种情况下,存储行的位置称为堆文件,并且它以没有特定顺序的方式存储数据(它可能是仅附加的,或者它可能会跟踪已删除的行以便用新数据覆盖它们)之后)。堆文件方法很常见,因为它可以避免存在多个二级索引时重复数据:每个索引仅引用堆文件中的一个位置,而实际数据保存在一个位置。

The key in an index is the thing that queries search for, but the value can be one of two things: it could be the actual row (document, vertex) in question, or it could be a reference to the row stored elsewhere. In the latter case, the place where rows are stored is known as a heap file, and it stores data in no particular order (it may be append-only, or it may keep track of deleted rows in order to overwrite them with new data later). The heap file approach is common because it avoids duplicating data when multiple secondary indexes are present: each index just references a location in the heap file, and the actual data is kept in one place.

当更新值而不更改键时,堆文件方法可能非常有效:只要新值不大于旧值,就可以就地覆盖记录。如果新值较大,情况会更复杂,因为它可能需要移动到堆中有足够空间的新位置。在这种情况下,要么需要更新所有索引以指向记录的新堆位置,要么将转发指针留在旧堆位置[ 5 ]。

When updating a value without changing the key, the heap file approach can be quite efficient: the record can be overwritten in place, provided that the new value is not larger than the old value. The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the record, or a forwarding pointer is left behind in the old heap location [5].

在某些情况下,从索引到堆文件的额外跳跃对于读取来说性能损失太大,因此可能需要将索引行直接存储在索引中。这称为聚集索引。例如,在MySQL的InnoDB存储引擎中,表的主键始终是聚集索引,二级索引引用主键(而不是堆文件位置)[31 ]。在 SQL Server 中,您可以为每个表指定一个聚集索引 [ 32 ]。

In some situations, the extra hop from the index to the heap file is too much of a performance penalty for reads, so it can be desirable to store the indexed row directly within an index. This is known as a clustered index. For example, in MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key (rather than a heap file location) [31]. In SQL Server, you can specify one clustered index per table [32].

聚集索引(存储索引内的所有行数据)和非聚集索引(仅存储对索引内数据的引用)之间的折衷称为覆盖索引包含 列的索引,它将表的一些列存储在索引[ 33 ]。这允许仅使用索引来回答某些查询(在这种情况下,据说索引覆盖查询)[ 32 ]。

A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns, which stores some of a table’s columns within the index [33]. This allows some queries to be answered by using the index alone (in which case, the index is said to cover the query) [32].

与任何类型的数据重复一样,聚集索引和覆盖索引可以加快读取速度,但它们需要额外的存储空间,并且会增加写入开销。数据库还需要付出额外的努力来强制执行事务保证,因为应用程序不应该看到由于重复而导致的不一致。

As with any kind of duplication of data, clustered and covering indexes can speed up reads, but they require additional storage and can add overhead on writes. Databases also need to go to additional effort to enforce transactional guarantees, because applications should not see inconsistencies due to the duplication.

多列索引

Multi-column indexes

到目前为止讨论的索引仅将单个键映射到一个值。如果我们需要同时查询表的多个列(或文档中的多个字段),这还不够。

The indexes discussed so far only map a single key to a value. That is not sufficient if we need to query multiple columns of a table (or multiple fields in a document) simultaneously.

最常见的多列索引类型称为串联索引,它只是通过将一列附加到另一列来将多个字段组合成一个键(索引定义指定字段的串联顺序)。这就像老式的纸质电话簿,提供从(姓氏名字)到电话号码的索引。由于排序顺序,索引可用于查找具有特定姓氏的所有人员,或具有特定 姓氏-名字组合的所有人员。但是,如果您想查找所有具有特定名字的人,则该索引毫无用处。

The most common type of multi-column index is called a concatenated index, which simply combines several fields into one key by appending one column to another (the index definition specifies in which order the fields are concatenated). This is like an old-fashioned paper phone book, which provides an index from (lastname, firstname) to phone number. Due to the sort order, the index can be used to find all the people with a particular last name, or all the people with a particular lastname-firstname combination. However, the index is useless if you want to find all the people with a particular first name.

多维索引是一种更通用的同时查询多个列的方法,这对于地理空间数据尤其重要。例如,餐馆搜索网站可能有一个包含每个餐馆的纬度和经度的数据库。当用户在地图上查看餐馆时,网站需要搜索用户当前正在查看的矩形地图区域内的所有餐馆。这需要一个二维范围查询,如下所示:

Multi-dimensional indexes are a more general way of querying several columns at once, which is particularly important for geospatial data. For example, a restaurant-search website may have a database containing the latitude and longitude of each restaurant. When a user is looking at the restaurants on a map, the website needs to search for all the restaurants within the rectangular map area that the user is currently viewing. This requires a two-dimensional range query like the following:

SELECT * FROM restaurants WHERE latitude  > 51.4946 AND latitude  < 51.5079
                            AND longitude > -0.1162 AND longitude < -0.1004;
SELECT * FROM restaurants WHERE latitude  > 51.4946 AND latitude  < 51.5079
                            AND longitude > -0.1162 AND longitude < -0.1004;

标准的 B 树或 LSM 树索引无法有效地回答此类查询:它可以为您提供某个纬度范围内(但在任意经度)的所有餐厅,或者某个经度范围内的所有餐厅(但位于北极和南极之间的任何地方),但不能同时出现。

A standard B-tree or LSM-tree index is not able to answer that kind of query efficiently: it can give you either all the restaurants in a range of latitudes (but at any longitude), or all the restaurants in a range of longitudes (but anywhere between the North and South poles), but not both simultaneously.

一种选择是使用空间填充曲线将二维位置转换为单个数字,然后使用常规 B 树索引 [ 34 ]。更常见的是,使用专门的空间索引,例如 R 树。例如,PostGIS 使用 PostgreSQL 的通用搜索树索引工具将地理空间索引实现为 R 树 [ 35 ]。我们在这里没有篇幅详细描述 R 树,但有大量关于它们的文献。

One option is to translate a two-dimensional location into a single number using a space-filling curve, and then to use a regular B-tree index [34]. More commonly, specialized spatial indexes such as R-trees are used. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s Generalized Search Tree indexing facility [35]. We don’t have space to describe R-trees in detail here, but there is plenty of literature on them.

一个有趣的想法是,多维索引不仅仅适用于地理位置。例如,在电子商务网站上,您可以使用维度(红色绿色蓝色)的三维索引来搜索特定颜色范围的产品,或者在天气观测数据库中,您可以使用二维索引索引(日期温度),以便有效地搜索 2013 年温度在 25 至 30℃ 之间的所有观测结果。使用一维索引,您必须扫描 2013 年以来的所有记录(无论温度如何),然后按温度过滤它们,反之亦然。二维索引可以同时按时间戳和温度缩小范围。 HyperDex [ 36 ] 使用了该技术。

An interesting idea is that multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce website you could use a three-dimensional index on the dimensions (red, green, blue) to search for products in a certain range of colors, or in a database of weather observations you could have a two-dimensional index on (date, temperature) in order to efficiently search for all the observations during the year 2013 where the temperature was between 25 and 30℃. With a one-dimensional index, you would have to either scan over all the records from 2013 (regardless of temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by timestamp and temperature simultaneously. This technique is used by HyperDex [36].

全文搜索和模糊索引

Full-text search and fuzzy indexes

到目前为止讨论的所有索引都假设您拥有精确的数据,并允许您查询键的精确值或具有排序顺序的键的一系列值。他们不允许您搜索相似的键,例如拼写错误的单词。这种模糊查询需要不同的技术。

All the indexes discussed so far assume that you have exact data and allow you to query for exact values of a key, or a range of values of a key with a sort order. What they don’t allow you to do is search for similar keys, such as misspelled words. Such fuzzy querying requires different techniques.

例如,全文搜索引擎通常允许对一个单词的搜索扩展为包括该单词的同义词、忽略单词的语法变体、以及搜索在同一文档中彼此靠近的单词的出现,并且支持各种取决于文本语言分析的其他特征。为了应对文档或查询中的拼写错误,Lucene 能够在文本中搜索一定编辑距离内的单词(编辑距离为 1 意味着添加、删除或替换了一个字母)[37 ]

For example, full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words, and to search for occurrences of words near each other in the same document, and support various other features that depend on linguistic analysis of the text. To cope with typos in documents or queries, Lucene is able to search text for words within a certain edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) [37].

正如“用 SSTable 制作 LSM 树” 中提到的,Lucene 使用类似 SSTable 的结构作为其术语字典。此结构需要一个小的内存索引,该索引告诉查询需要在排序文件中的哪个偏移量处查找键。在 LevelDB 中,内存索引是一些键的稀疏集合,但在 Lucene 中,内存索引是键中字符的有限状态自动机,类似于 trie [ 38 ]。该自动机可以转换为Levenshtein 自动机,它支持在给定编辑距离内有效搜索单词[ 39 ]。

As mentioned in “Making an LSM-tree out of SSTables”, Lucene uses a SSTable-like structure for its term dictionary. This structure requires a small in-memory index that tells queries at which offset in the sorted file they need to look for a key. In LevelDB, this in-memory index is a sparse collection of some of the keys, but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie [38]. This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance [39].

其他模糊搜索技术朝着文档分类和机器学习的方向发展。有关更多详细信息,请参阅信息检索教科书[例如,40 ]。

Other fuzzy search techniques go in the direction of document classification and machine learning. See an information retrieval textbook for more detail [e.g., 40].

把一切都记在记忆里

Keeping everything in memory

本章到目前为止讨论的数据结构都是针对磁盘限制的答案。与主存相比,磁盘处理起来很困难。无论是磁盘还是 SSD,如果您希望获得良好的读写性能,都需要仔细布局磁盘上的数据。然而,我们容忍这种尴尬,因为磁盘有两个显着的优点:它们耐用(如果关闭电源,其内容不会丢失),并且它们的每 GB 成本比 RAM 更低。

The data structures discussed so far in this chapter have all been answers to the limitations of disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs, data on disk needs to be laid out carefully if you want good performance on reads and writes. However, we tolerate this awkwardness because disks have two significant advantages: they are durable (their contents are not lost if the power is turned off), and they have a lower cost per gigabyte than RAM.

随着 RAM 变得越来越便宜,每 GB 成本的争论被削弱了。许多数据集根本就没有那么大,因此将它们完全保存在内存中是非常可行的,并且可能分布在多台机器上。这导致了内存数据库的发展。

As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several machines. This has led to the development of in-memory databases.

某些内存中键值存储(例如 Memcached)仅用于缓存,如果计算机重新启动,数据丢失是可以接受的。但其他内存数据库的目标是持久性,这可以通过特殊硬件(例如电池供电的 RAM)、将更改日志写入磁盘、将定期快照写入磁盘或复制内存状态来实现到其他机器。

Some in-memory key-value stores, such as Memcached, are intended for caching use only, where it’s acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state to other machines.

当内存数据库重新启动时,它需要从磁盘或通过网络从副本重新加载其状态(除非使用特殊硬件)。尽管写入磁盘,它仍然是一个内存数据库,因为磁盘仅用作持久性的仅附加日志,并且读取完全从内存提供。写入磁盘还具有操作优势:外部实用程序可以轻松备份、检查和分析磁盘上的文件。

When an in-memory database is restarted, it needs to reload its state, either from disk or over the network from a replica (unless special hardware is used). Despite writing to disk, it’s still an in-memory database, because the disk is merely used as an append-only log for durability, and reads are served entirely from memory. Writing to disk also has operational advantages: files on disk can easily be backed up, inspected, and analyzed by external utilities.

VoltDB、MemSQL 和 Oracle TimesTen 等产品是具有关系模型的内存数据库,供应商声称它们可以通过消除与管理磁盘数据结构相关的所有开销来提供巨大的性能改进 [41 , 42 ]。RAMCloud 是一种开源的、具有持久性的内存中键值存储(对内存中的数据以及磁盘上的数据使用日志结构方法)[43 ] Redis 和 Couchbase 通过异步写入磁盘来提供较弱的持久性。

Products such as VoltDB, MemSQL, and Oracle TimesTen are in-memory databases with a relational model, and the vendors claim that they can offer big performance improvements by removing all the overheads associated with managing on-disk data structures [41, 42]. RAMCloud is an open source, in-memory key-value store with durability (using a log-structured approach for the data in memory as well as the data on disk) [43]. Redis and Couchbase provide weak durability by writing to disk asynchronously.

与直觉相反,内存数据库的性能优势并不是因为它们不需要从磁盘读取。如果您有足够的内存,即使基于磁盘的存储引擎也可能永远不需要从磁盘读取数据,因为操作系统无论如何都会将最近使用的磁盘块缓存在内存中。相反,它们可以更快,因为它们可以避免以可写入磁盘的形式编码内存中数据结构的开销[ 44 ]。

Counterintuitively, the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk [44].

除了性能之外,内存数据库的另一个有趣领域是提供难以使用基于磁盘的索引实现的数据模型。例如,Redis 为各种数据结构(例如优先级队列和集合)提供类似数据库的接口。由于它将所有数据保存在内存中,因此其实现相对简单。

Besides performance, another interesting area for in-memory databases is providing data models that are difficult to implement with disk-based indexes. For example, Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.

最近的研究表明,内存数据库架构可以扩展到支持大于可用内存的数据集,而不会带来以磁盘为中心的架构的开销[ 45 ]。所谓的反缓存方法的工作原理是,当内存不足时,将最近最少使用的数据从内存移出到磁盘,并在将来再次访问时将其加载回内存。这类似于操作系统对虚拟内存和交换文件的处理,但数据库可以比操作系统更有效地管理内存,因为它可以以单个记录的粒度而不是整个内存页的粒度工作。不过,这种方法仍然需要索引完全适合内存(就像本章开头的 Bitcask 示例)。

Recent research indicates that an in-memory database architecture could be extended to support datasets larger than the available memory, without bringing back the overheads of a disk-centric architecture [45]. The so-called anti-caching approach works by evicting the least recently used data from memory to disk when there is not enough memory, and loading it back into memory when it is accessed again in the future. This is similar to what operating systems do with virtual memory and swap files, but the database can manage memory more efficiently than the OS, as it can work at the granularity of individual records rather than entire memory pages. This approach still requires indexes to fit entirely in memory, though (like the Bitcask example at the beginning of the chapter).

如果非易失性存储器(NVM)技术得到更广泛的采用,可能需要对存储引擎设计进行进一步的改变[ 46 ]。目前,这是一个新的研究领域,但未来值得关注。

Further changes to storage engine design will probably be needed if non-volatile memory (NVM) technologies become more widely adopted [46]. At present, this is a new area of research, but it is worth keeping an eye on in the future.

交易处理还是分析?

Transaction Processing or Analytics?

在业务数据处理的早期,对数据库的写入通常对应于 正在发生的商业交易:进行销售、向供应商下订单、支付员工工资等。尽管涉及资金易手,但交易这个术语仍然被保留,指的是形成一个逻辑单元的一组读取和写入。

In the early days of business data processing, a write to the database typically corresponded to a commercial transaction taking place: making a sale, placing an order with a supplier, paying an employee’s salary, etc. As databases expanded into areas that didn’t involve money changing hands, the term transaction nevertheless stuck, referring to a group of reads and writes that form a logical unit.

笔记

事务不一定具有 ACID(原子性、一致性、隔离性和持久性)属性。事务处理仅意味着允许客户端进行低延迟读取和写入,而不是仅定期运行(例如每天一次)的批处理作业。我们将在第 7 章中讨论 ACID 属性,并在第 10 章中讨论批处理。

A transaction needn’t necessarily have ACID (atomicity, consistency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes—as opposed to batch processing jobs, which only run periodically (for example, once per day). We discuss the ACID properties in Chapter 7 and batch processing in Chapter 10.

尽管数据库开始用于许多不同类型的数据(博客文章的评论、游戏中的操作、地址簿中的联系人等),但基本访问模式仍然类似于处理业务事务。应用程序通常使用索引按某个键查找少量记录。根据用户的输入插入或更新记录。由于这些应用程序是交互式的,因此访问模式被称为在线事务处理 (OLTP)。

Even though databases started being used for many different kinds of data—comments on blog posts, actions in a game, contacts in an address book, etc.—the basic access pattern remained similar to processing business transactions. An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).

然而,数据库也开始越来越多地用于数据分析,其访问模式截然不同。通常,分析查询需要扫描大量记录,仅读取每条记录的几列,并计算聚合统计信息(例如计数、总和或平均值),而不是将原始数据返回给用户。例如,如果您的数据是销售交易表,则分析查询可能是:

However, databases also started being increasingly used for data analytics, which has very different access patterns. Usually an analytic query needs to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user. For example, if your data is a table of sales transactions, then analytic queries might be:

  • 一月份我们每家商店的总收入是多少?

  • What was the total revenue of each of our stores in January?

  • 在最近的促销活动中,我们比平时多卖了多少根香蕉?

  • How many more bananas than usual did we sell during our latest promotion?

  • 哪个品牌的婴儿食品最常与 X 品牌尿布一起购买?

  • Which brand of baby food is most often purchased together with brand X diapers?

这些查询通常由业务分析师编写,并输入到报告中,帮助公司管理层做出更好的决策(商业智能)。为了区分这种使用数据库和事务处理的模式,它被称为在线分析处理 (OLAP)[ 47 ]。iv OLTP 和 OLAP 之间的区别并不总是很明显,但表 3-1列出了一些典型特征。

These queries are often written by business analysts, and feed into reports that help the management of a company make better decisions (business intelligence). In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP) [47].iv The difference between OLTP and OLAP is not always clear-cut, but some typical characteristics are listed in Table 3-1.

表 3-1。比较事务处理与分析系统的特征
财产 事务处理系统 (OLTP) 分析系统(OLAP)

主要阅读模式

Main read pattern

每次查询的记录数量较少,通过键获取

Small number of records per query, fetched by key

聚合大量记录

Aggregate over large number of records

主要写入模式

Main write pattern

根据用户输入进行随机访问、低延迟写入

Random-access, low-latency writes from user input

批量导入 (ETL) 或事件流

Bulk import (ETL) or event stream

主要用于

Primarily used by

最终用户/客户,通过网络应用程序

End user/customer, via web application

内部分析师,提供决策支持

Internal analyst, for decision support

数据代表什么

What data represents

最新数据状态(当前时间点)

Latest state of data (current point in time)

随着时间的推移发生的事件的历史

History of events that happened over time

数据集大小

Dataset size

千兆字节到太字节

Gigabytes to terabytes

TB 到 PB

Terabytes to petabytes

最初,相同的数据库用于事务处理和分析查询。事实证明,SQL 在这方面非常灵活:它适用于 OLTP 类型查询以及 OLAP 类型查询。然而,在 20 世纪 80 年代末和 90 年代初,公司出现了一种趋势,即停止使用 OLTP 系统进行分析,而是在单独的数据库上运行分析。这个独立的数据库称为数据仓库

At first, the same databases were used for both transaction processing and analytic queries. SQL turned out to be quite flexible in this regard: it works well for OLTP-type queries as well as OLAP-type queries. Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their OLTP systems for analytics purposes, and to run the analytics on a separate database instead. This separate database was called a data warehouse.

数据仓库

Data Warehousing

企业可能拥有数十个不同的交易处理系统:为面向客户的网站提供支持的系统、控制实体店中的销售点(结账)系统、跟踪仓库中的库存、规划车辆路线、管理供应商、管理员工等。这些系统很复杂,需要一个团队来维护它,因此这些系统最终大多是相互独立运行的。

An enterprise may have dozens of different transaction processing systems: systems powering the customer-facing website, controlling point of sale (checkout) systems in physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers, administering employees, etc. Each of these systems is complex and needs a team of people to maintain it, so the systems end up operating mostly autonomously from each other.

这些 OLTP 系统通常期望具有高可用性并以低延迟处理事务,因为它们通常对业务运营至关重要。因此,数据库管理员严密保护他们的 OLTP 数据库。他们通常不愿意让业务分析师在 OLTP 数据库上运行即席分析查询,因为这些查询通常成本高昂,需要扫描大部分数据集,这可能会损害并发执行事务的性能。

These OLTP systems are usually expected to be highly available and to process transactions with low latency, since they are often critical to the operation of the business. Database administrators therefore closely guard their OLTP databases. They are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP database, since those queries are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions.

相比之下,数据仓库是一个独立的数据库,分析师可以随意查询,而不影响 OLTP 操作 [ 48 ]。数据仓库包含公司所有各种 OLTP 系统中数据的只读副本。数据从 OLTP 数据库中提取(使用定期数据转储或连续更新流),转换为易于分析的模式,进行清理,然后加载到数据仓库中。将数据获取到仓库的这个过程称为 提取-转换-加载(ETL),如图 3-8所示。

A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations [48]. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL) and is illustrated in Figure 3-8.

直达0308
图 3-8。将 ETL 简化为数据仓库。

现在几乎所有大型企业都存在数据仓库,但在小公司却几乎闻所未闻。这可能是因为大多数小公司没有那么多不同的 OLTP 系统,而且大多数小公司的数据量很小,小到可以在传统 SQL 数据库中查询,甚至可以在电子表格中进行分析。在大公司中,需要做很多繁重的工作才能完成在小公司中简单的事情。

Data warehouses now exist in almost all large enterprises, but in small companies they are almost unheard of. This is probably because most small companies don’t have so many different OLTP systems, and most small companies have a small amount of data—small enough that it can be queried in a conventional SQL database, or even analyzed in a spreadsheet. In a large company, a lot of heavy lifting is required to do something that is simple in a small company.

使用单独的数据仓库(而不是直接查询 OLTP 系统进行分析)的一大优势是可以针对分析访问模式优化数据仓库。事实证明,本章前半部分讨论的索引算法非常适合 OLTP,但不太擅长回答分析查询。在本章的其余部分中,我们将讨论针对分析进行优化的存储引擎。

A big advantage of using a separate data warehouse, rather than querying OLTP systems directly for analytics, is that the data warehouse can be optimized for analytic access patterns. It turns out that the indexing algorithms discussed in the first half of this chapter work well for OLTP, but are not very good at answering analytic queries. In the rest of this chapter we will look at storage engines that are optimized for analytics instead.

OLTP数据库和数据仓库之间的差异

The divergence between OLTP databases and data warehouses

数据仓库的数据模型最常见的是关系型,因为 SQL 通常非常适合分析查询。有许多图形数据分析工具可以生成 SQL 查询、可视化结果并允许分析师探索数据(通过下 切片和切块等操作)。

The data model of a data warehouse is most commonly relational, because SQL is generally a good fit for analytic queries. There are many graphical data analysis tools that generate SQL queries, visualize the results, and allow analysts to explore the data (through operations such as drill-down and slicing and dicing).

从表面上看,数据仓库和关系型OLTP数据库看起来很相似,因为它们都有SQL查询接口。然而,系统的内部结构可能看起来完全不同,因为它们针对非常不同的查询模式进行了优化。许多数据库供应商现在专注于支持事务处理或分析工作负载,但不是同时支持两者。

On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface. However, the internals of the systems can look quite different, because they are optimized for very different query patterns. Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.

某些数据库(例如 Microsoft SQL Server 和 SAP HANA)在同一产品中支持事务处理和数据仓库。然而,它们越来越成为两个独立的存储和查询引擎,恰好可以通过通用 SQL 接口进行访问[ 49,50,51 ]

Some databases, such as Microsoft SQL Server and SAP HANA, have support for transaction processing and data warehousing in the same product. However, they are increasingly becoming two separate storage and query engines, which happen to be accessible through a common SQL interface [49, 50, 51].

Teradata、Vertica、SAP HANA 和 ParAccel 等数据仓库供应商通常以昂贵的商业许可证出售其系统。Amazon RedShift 是 ParAccel 的托管版本。最近,出现了大量开源 SQL-on-Hadoop 项目;他们很年轻,但目标是与商业数据仓库系统竞争。其中包括 Apache Hive、Spark SQL、Cloudera Impala、Facebook Presto、Apache Tajo 和 Apache Drill [ 52、53 ]。其中一些是基于 Google 的 Dremel [ 54 ] 的想法。

Data warehouse vendors such as Teradata, Vertica, SAP HANA, and ParAccel typically sell their systems under expensive commercial licenses. Amazon RedShift is a hosted version of ParAccel. More recently, a plethora of open source SQL-on-Hadoop projects have emerged; they are young but aiming to compete with commercial data warehouse systems. These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill [52, 53]. Some of them are based on ideas from Google’s Dremel [54].

星星和雪花:分析模式

Stars and Snowflakes: Schemas for Analytics

正如第 2 章中所探讨的,根据应用程序的需求,在事务处理领域中使用了各种不同的数据模型。另一方面,在分析中,数据模型的多样性要少得多。许多数据仓库都以相当公式化的方式使用,称为星型模式(也称为维度建模 [ 55 ])。

As explored in Chapter 2, a wide range of different data models are used in the realm of transaction processing, depending on the needs of the application. On the other hand, in analytics, there is much less diversity of data models. Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimensional modeling [55]).

图 3-9 中的示例架构显示了杂货零售商处可能存在的数据仓库。该模式的中心是一个所谓的事实表(在本例中,它称为事实表 fact_sales)。事实表的每一行代表在特定时间发生的一个事件(这里,每一行代表客户购买产品)。如果我们分析网站流量而不是零售额,则每一行可能代表用户的一次页面浏览或一次点击。

The example schema in Figure 3-9 shows a data warehouse that might be found at a grocery retailer. At the center of the schema is a so-called fact table (in this example, it is called fact_sales). Each row of the fact table represents an event that occurred at a particular time (here, each row represents a customer’s purchase of a product). If we were analyzing website traffic rather than retail sales, each row might represent a page view or a click by a user.

直达0309
图 3-9。用于数据仓库的星型模式示例。

通常,事实被捕获为单独的事件,因为这可以为以后的分析提供最大的灵活性。然而,这意味着事实表可能变得非常大。像苹果、沃尔玛或 eBay 这样的大企业,其数据仓库中可能有数十 PB 的交易历史记录,其中大部分实际上是表格 [ 56 ]。

Usually, facts are captured as individual events, because this allows maximum flexibility of analysis later. However, this means that the fact table can become extremely large. A big enterprise like Apple, Walmart, or eBay may have tens of petabytes of transaction history in its data warehouse, most of which is in fact tables [56].

事实表中的某些列是属性,例如产品的销售价格以及从供应商处购买产品的成本(允许计算利润率)。事实表中的其他列是对其他表的外键引用,称为维度表。由于事实表中的每一行代表一个事件,因此维度代表事件的“” 、 “事件” 、 “地点”、 “时间”、“方式”和“原因”

Some of the columns in the fact table are attributes, such as the price at which the product was sold and the cost of buying it from the supplier (allowing the profit margin to be calculated). Other columns in the fact table are foreign key references to other tables, called dimension tables. As each row in the fact table represents an event, the dimensions represent the who, what, where, when, how, and why of the event.

例如,在图 3-9中,维度之一是销售的产品。表中的每一行dim_product代表一种待售产品,包括其库存单位(SKU)、描述、品牌名称、类别、脂肪含量、包装尺寸等。表中的每一行都使用一个外键 fact_sales来表明在该特定交易中销售了哪种产品。(为简单起见,如果客户一次购买多种不同的产品,它们将在事实表中表示为单独的行。)

For example, in Figure 3-9, one of the dimensions is the product that was sold. Each row in the dim_product table represents one type of product that is for sale, including its stock-keeping unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the fact_sales table uses a foreign key to indicate which product was sold in that particular transaction. (For simplicity, if the customer buys several different products at once, they are represented as separate rows in the fact table.)

甚至日期和时间也经常使用维度表来表示,因为这允许对有关日期(例如公共假期)的附加信息进行编码,从而允许查询区分假期和非假期的销售。

Even date and time are often represented using dimension tables, because this allows additional information about dates (such as public holidays) to be encoded, allowing queries to differentiate between sales on holidays and non-holidays.

“星型模式”这个名字来源于这样一个事实:当表关系可视化时,事实表位于中间,周围是维度表;这些桌子的连接就像星星的光芒。

The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.

此模板的变体称为雪花模式,其中维度进一步细分为子维度。例如,品牌和产品类别可以有单独的表,表中的每一行都dim_product可以将品牌和类别引用为外键,而不是将它们存储为表中的字符串dim_product。雪花模式比星型模式更加规范化,但星型模式通常是首选,因为它们对于分析师来说更容易使用[ 55 ]。

A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions. For example, there could be separate tables for brands and product categories, and each row in the dim_product table could reference the brand and category as foreign keys, rather than storing them as strings in the dim_product table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with [55].

在典型的数据仓库中,表通常非常宽:事实表通常有超过 100 列,有时有数百列 [ 51 ]。维度表也可以非常宽,因为它们包括可能与分析相关的所有元数据 - 例如,该dim_store表可能包括每个商店提供哪些服务的详细信息、是否有店内面包店、面积、商店首次开业的日期、最后一次改造的时间、距离最近的高速公路有多远等。

In a typical data warehouse, tables are often very wide: fact tables often have over 100 columns, sometimes several hundred [51]. Dimension tables can also be very wide, as they include all the metadata that may be relevant for analysis—for example, the dim_store table may include details of which services are offered at each store, whether it has an in-store bakery, the square footage, the date when the store was first opened, when it was last remodeled, how far it is from the nearest highway, etc.

面向列的存储

Column-Oriented Storage

如果您的事实表中有数万亿行和 PB 级的数据,那么有效地存储和查询它们将成为一个具有挑战性的问题。维度表通常要小得多(数百万行),因此在本节中我们将主要关注事实的存储。

If you have trillions of rows and petabytes of data in your fact tables, storing and querying them efficiently becomes a challenging problem. Dimension tables are usually much smaller (millions of rows), so in this section we will concentrate primarily on storage of facts.

尽管事实表的宽度通常超过 100 列,但典型的数据仓库查询一次只能访问其中的 4 或 5 列("SELECT *"分析很少需要查询)[ 51 ]。以示例 3-1中的查询为例 :它访问大量行(在 2013 日历年期间每次有人购买水果或糖果),但它只需要访问表的三列fact_salesdate_keyproduct_skquantity。该查询忽略所有其他列。

Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time ("SELECT *" queries are rarely needed for analytics) [51]. Take the query in Example 3-1: it accesses a large number of rows (every occurrence of someone buying fruit or candy during the 2013 calendar year), but it only needs to access three columns of the fact_sales table: date_key, product_sk, and quantity. The query ignores all other columns.

例 3-1。根据一周中的哪一天,分析人们是否更倾向于购买新鲜水果或糖果
SELECT
  dim_date.weekday, dim_product.category,
  SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
  JOIN dim_date    ON fact_sales.date_key   = dim_date.date_key
  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
  dim_date.year = 2013 AND
  dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
  dim_date.weekday, dim_product.category;
SELECT
  dim_date.weekday, dim_product.category,
  SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
  JOIN dim_date    ON fact_sales.date_key   = dim_date.date_key
  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
  dim_date.year = 2013 AND
  dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
  dim_date.weekday, dim_product.category;

我们怎样才能有效地执行这个查询呢?

How can we execute this query efficiently?

在大多数 OLTP 数据库中,存储以面向行的方式 布局:表一行中的所有值都彼此相邻存储。文档数据库类似:整个文档通常存储为一个连续的字节序列。您可以在图 3-1的 CSV 示例中看到这一点 。

In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes. You can see this in the CSV example of Figure 3-1.

为了处理像示例 3-1这样的查询,您可以在 fact_sales.date_key和/或上建立索引fact_sales.product_sk,告诉存储引擎在哪里可以找到特定日期或特定产品的所有销售。但是,面向行的存储引擎仍然需要将所有这些行(每行包含超过 100 个属性)从磁盘加载到内存中,解析它们,并过滤掉那些不满足所需条件的行。这可能需要很长时间。

In order to process a query like Example 3-1, you may have indexes on fact_sales.date_key and/or fact_sales.product_sk that tell the storage engine where to find all the sales for a particular date or for a particular product. But then, a row-oriented storage engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into memory, parse them, and filter out those that don’t meet the required conditions. That can take a long time.

面向列的存储 背后的想法很简单:不要将一行中的所有值存储在一起,而是将每一中的所有值存储在一起。如果每个列都存储在单独的文件中,则查询只需读取并解析该查询中使用的那些列,这可以节省大量工作。该原理如图3-10所示。

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work. This principle is illustrated in Figure 3-10.

笔记

列存储在关系数据模型中最容易理解,但它同样适用于非关系数据。例如,Parquet [ 57 ] 是一种支持文档数据模型的列式存储格式,基于 Google 的 Dremel [ 54 ]。

Column storage is easiest to understand in a relational data model, but it applies equally to nonrelational data. For example, Parquet [57] is a columnar storage format that supports a document data model, based on Google’s Dremel [54].

直达0310
图 3-10。按列而不是按行存储关系数据。

面向列的存储布局依赖于每个列文件包含相同顺序的行。因此,如果您需要重新组装整行,则可以从每个单独的列文件中取出第 23 个条目,并将它们放在一起形成表的第 23 行。

The column-oriented storage layout relies on each column file containing the rows in the same order. Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the individual column files and put them together to form the 23rd row of the table.

列压缩

Column Compression

除了仅从磁盘加载查询所需的列之外,我们还可以通过压缩数据来进一步降低对磁盘吞吐量的需求。幸运的是,面向列的存储通常非常适合压缩。

Besides only loading those columns from disk that are required for a query, we can further reduce the demands on disk throughput by compressing data. Fortunately, column-oriented storage often lends itself very well to compression.

看一下图 3-10中每列的值序列:它们通常看起来非常重复,这是压缩的好兆头。根据列中的数据,可以使用不同的压缩技术。位图编码是一种在数据仓库中特别有效的技术,如图 3-11所示。

Take a look at the sequences of values for each column in Figure 3-10: they often look quite repetitive, which is a good sign for compression. Depending on the data in the column, different compression techniques can be used. One technique that is particularly effective in data warehouses is bitmap encoding, illustrated in Figure 3-11.

直达0311
图 3-11。单列的压缩、位图索引存储。

通常,与行数相比,列中的不同值的数量很少(例如,零售商可能有数十亿的销售交易,但只有 100,000 种不同的产品)。现在,我们可以采用具有n 个不同值的列,并将其转换为n个单独的位图:每个不同值对应一个位图,每行对应一个位。如果该行具有该值,则该位为 1,否则为 0。

Often, the number of distinct values in a column is small compared to the number of rows (for example, a retailer may have billions of sales transactions, but only 100,000 distinct products). We can now take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.

如果n非常小(例如,国家/地区列可能有大约 200 个不同的值),则这些位图可以每行一位进行存储。但如果n较大,大多数位图中都会有很多零(我们说它们是稀疏的)。在这种情况下,位图还可以进行游程编码,如图3-11底部所示。这可以使列的编码变得非常紧凑。

If n is very small (for example, a country column may have approximately 200 distinct values), those bitmaps can be stored with one bit per row. But if n is bigger, there will be a lot of zeros in most of the bitmaps (we say that they are sparse). In that case, the bitmaps can additionally be run-length encoded, as shown at the bottom of Figure 3-11. This can make the encoding of a column remarkably compact.

诸如此类的位图索引非常适合数据仓库中常见的查询类型。例如:

Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data warehouse. For example:

WHERE product_sk IN (30, 68, 69):
WHERE product_sk IN (30, 68, 69):

product_sk = 30加载、product_sk = 68、 和的三个位图,并计算这三个位图的product_sk = 69按位或,这可以非常有效地完成。

Load the three bitmaps for product_sk = 30, product_sk = 68, and product_sk = 69, and calculate the bitwise OR of the three bitmaps, which can be done very efficiently.

WHERE product_sk = 31 AND store_sk = 3:
WHERE product_sk = 31 AND store_sk = 3:

product_sk = 31加载和的位图store_sk = 3,并计算按位AND。这是可行的,因为列包含的行顺序相同,因此一列位图中的第 k位与另一列位图中的第 k位对应于同一行。

Load the bitmaps for product_sk = 31 and store_sk = 3, and calculate the bitwise AND. This works because the columns contain the rows in the same order, so the kth bit in one column’s bitmap corresponds to the same row as the kth bit in another column’s bitmap.

对于不同类型的数据,还有各种其他压缩方案,但我们不会详细讨论它们——请参阅[ 58 ]了解概述。

There are also various other compression schemes for different kinds of data, but we won’t go into them in detail—see [58] for an overview.

面向列的存储和列族

Column-oriented storage and column families

Cassandra 和 HBase 有一个列族的概念,它们继承自 Bigtable [ 9 ]。然而,将它们称为面向列是非常误导的:在每个列族中,它们将行中的所有列以及行键存储在一起,并且不使用列压缩。因此,Bigtable 模型仍然主要是面向行的。

Cassandra and HBase have a concept of column families, which they inherited from Bigtable [9]. However, it is very misleading to call them column-oriented: within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

内存带宽和矢量化处理

Memory bandwidth and vectorized processing

对于需要扫描数百万行的数据仓库查询,一个很大的瓶颈是将数据从磁盘获取到内存的带宽。然而,这并不是唯一的瓶颈。分析数据库的开发人员还担心如何有效地使用从主存到 CPU 缓存的带宽,避免 CPU 指令处理管道中的分支错误预测和气泡,以及在现代 CPU 中使用单指令多数据 (SIMD) 指令。59、60 ] _

For data warehouse queries that need to scan over millions of rows, a big bottleneck is the bandwidth for getting data from disk into memory. However, that is not the only bottleneck. Developers of analytical databases also worry about efficiently using the bandwidth from main memory into the CPU cache, avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline, and making use of single-instruction-multi-data (SIMD) instructions in modern CPUs [59, 60].

除了减少需要从磁盘加载的数据量之外,面向列的存储布局也有利于有效利用 CPU 周期。例如,查询引擎可以获取一块适合 CPU 一级缓存的压缩列数据,并在紧密循环中迭代它(即,没有函数调用)。CPU 执行此类循环的速度比需要为处理的每条记录进行大量函数调用和条件的代码快得多。列压缩允许列中的更多行适合相同数量的 L1 缓存。运算符(例如前面描述的按位ANDOR)可以设计为直接对压缩列数据的此类块进行操作。该技术称为矢量化处理 [ 58 , 49 ]。

Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles. For example, the query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop (that is, with no function calls). A CPU can execute such a loop much faster than code that requires a lot of function calls and conditions for each record that is processed. Column compression allows more rows from a column to fit in the same amount of L1 cache. Operators, such as the bitwise AND and OR described previously, can be designed to operate on such chunks of compressed column data directly. This technique is known as vectorized processing [58, 49].

列存储中的排序顺序

Sort Order in Column Storage

在列存储中,行的存储顺序并不一定重要。最简单的方法是按照插入的顺序存储它们,因为插入新行就意味着附加到每个列文件。但是,我们可以选择强加一个顺序,就像我们之前对 SSTable 所做的那样,并将其用作索引机制。

In a column store, it doesn’t necessarily matter in which order the rows are stored. It’s easiest to store them in the order in which they were inserted, since then inserting a new row just means appending to each of the column files. However, we can choose to impose an order, like we did with SSTables previously, and use that as an indexing mechanism.

请注意,独立对每一列进行排序是没有意义的,因为这样我们将不再知道列中的哪些项目属于同一行。我们只能重建一行,因为我们知道一列中的第k个项目与另一列中的第 k个项目属于同一行。

Note that it wouldn’t make sense to sort each column independently, because then we would no longer know which items in the columns belong to the same row. We can only reconstruct a row because we know that the kth item in one column belongs to the same row as the kth item in another column.

相反,数据需要一次对整行进行排序,即使它是按列存储的。数据库管理员可以利用他们对常见查询的了解来选择对表进行排序的列。例如,如果查询经常定位日期范围(例如上个月),则创建date_key第一个排序键可能是有意义的。然后查询优化器可以只扫描上个月的行,这比扫描所有行要快得多。

Rather, the data needs to be sorted an entire row at a time, even though it is stored by column. The administrator of the database can choose the columns by which the table should be sorted, using their knowledge of common queries. For example, if queries often target date ranges, such as the last month, it might make sense to make date_key the first sort key. Then the query optimizer can scan only the rows from the last month, which will be much faster than scanning all rows.

第二列可以确定在第一列中具有相同值的任何行的排序顺序。例如,如果是图 3-10date_key中的第一个排序键,则它可能是第二个排序键,以便同一天同一产品的所有销售都分组在一起存储。这将有助于需要在特定日期范围内按产品对销售进行分组或筛选的查询。product_sk

A second column can determine the sort order of any rows that have the same value in the first column. For example, if date_key is the first sort key in Figure 3-10, it might make sense for product_sk to be the second sort key so that all sales for the same product on the same day are grouped together in storage. That will help queries that need to group or filter sales by product within a certain date range.

排序顺序的另一个优点是它可以帮助压缩列。如果主排序列没有很多不同的值,那么排序后,它将具有很长的序列,其中相同的值在一行中重复多次。一个简单的游程长度编码(就像我们对图 3-11中的位图使用的那样)可以将该列压缩到几千字节 — 即使表有数十亿行。

Another advantage of sorted order is that it can help with compression of columns. If the primary sort column does not have many distinct values, then after sorting, it will have long sequences where the same value is repeated many times in a row. A simple run-length encoding, like we used for the bitmaps in Figure 3-11, could compress that column down to a few kilobytes—even if the table has billions of rows.

该压缩效果对第一个排序键最强。第二个和第三个排序键将更加混乱,因此不会有如此长的重复值。排序优先级靠后的列基本上以随机顺序出现,因此它们可能也不会压缩。但总体而言,对前几列进行排序仍然是一个胜利。

That compression effect is strongest on the first sort key. The second and third sort keys will be more jumbled up, and thus not have such long runs of repeated values. Columns further down the sorting priority appear in essentially random order, so they probably won’t compress as well. But having the first few columns sorted is still a win overall.

几种不同的排序顺序

Several different sort orders

C-Store 引入了这一想法的巧妙扩展,并在商业数据仓库 Vertica 中采用[ 61 , 62 ]。不同的查询受益于不同的排序顺序,那么为什么不存储以多种不同方式排序的相同数据呢 ?无论如何,数据都需要复制到多台机器上,这样如果一台机器出现故障,您就不会丢失数据。您还可以存储以不同方式排序的冗余数据,以便在处理查询时,可以使用最适合查询模式的版本

A clever extension of this idea was introduced in C-Store and adopted in the commercial data warehouse Vertica [61, 62]. Different queries benefit from different sort orders, so why not store the same data sorted in several different ways? Data needs to be replicated to multiple machines anyway, so that you don’t lose data if one machine fails. You might as well store that redundant data sorted in different ways so that when you’re processing a query, you can use the version that best fits the query pattern.

在面向列的存储中拥有多个排序顺序有点类似于在面向行的存储中拥有多个二级索引。但最大的区别在于,面向行的存储将每一行保留在一个位置(在堆文件或聚集索引中),而二级索引仅包含指向匹配行的指针。在列存储中,通常没有任何指向其他地方的数据的指针,只有包含值的列。

Having multiple sort orders in a column-oriented store is a bit similar to having multiple secondary indexes in a row-oriented store. But the big difference is that the row-oriented store keeps every row in one place (in the heap file or a clustered index), and secondary indexes just contain pointers to the matching rows. In a column store, there normally aren’t any pointers to data elsewhere, only columns containing values.

写入列式存储

Writing to Column-Oriented Storage

这些优化在数据仓库中很有意义,因为大部分负载由分析师运行的大型只读查询组成。面向列的存储、压缩和排序都有助于加快读取查询的速度。然而,它们的缺点是使写入变得更加困难。

These optimizations make sense in data warehouses, because most of the load consists of large read-only queries run by analysts. Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.

就地更新方法(如 B 树使用)对于压缩列来说是不可能的。如果您想在排序表的中间插入一行,您很可能必须重写所有列文件。由于行是通过其在列中的位置来标识的,因此插入必须一致地更新所有列。

An update-in-place approach, like B-trees use, is not possible with compressed columns. If you wanted to insert a row in the middle of a sorted table, you would most likely have to rewrite all the column files. As rows are identified by their position within a column, the insertion has to update all columns consistently.

幸运的是,我们在本章前面已经看到了一个很好的解决方案:LSM 树。所有写入首先进入内存存储,在那里它们被添加到排序结构中并准备写入磁盘。内存存储是面向行的还是面向列的并不重要。当积累了足够的写入量时,它们将与磁盘上的列文件合并并批量写入新文件。这本质上就是 Vertica 所做的事情 [ 6​​2 ]。

Fortunately, we have already seen a good solution earlier in this chapter: LSM-trees. All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk. It doesn’t matter whether the in-memory store is row-oriented or column-oriented. When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk. This is essentially what Vertica does [62].

查询需要检查磁盘上的列数据和内存中最近的写入,并将两者结合起来。然而,查询优化器向用户隐藏了这种区别。从分析师的角度来看,通过插入、更新或删除修改的数据会立即反映在后续查询中。

Queries need to examine both the column data on disk and the recent writes in memory, and combine the two. However, the query optimizer hides this distinction from the user. From an analyst’s point of view, data that has been modified with inserts, updates, or deletes is immediately reflected in subsequent queries.

聚合:数据立方体和物化视图

Aggregation: Data Cubes and Materialized Views

并非每个数据仓库都一定是列存储:还使用传统的面向行的数据库和一些其他架构。然而,列式存储对于即席分析查询的速度明显更快,因此它正在迅速流行[ 51 , 63 ]。

Not every data warehouse is necessarily a column store: traditional row-oriented databases and a few other architectures are also used. However, columnar storage can be significantly faster for ad hoc analytical queries, so it is rapidly gaining popularity [51, 63].

数据仓库的另一个值得一提的方面是物化聚合。如前所述,数据仓库查询通常涉及聚合函数,例如SQL 中的COUNTSUMAVGMIN或。MAX如果许多不同的查询使用相同的聚合,则每次都处理原始数据可能会很浪费。为什么不缓存查询最常使用的一些计数或总和?

Another aspect of data warehouses that is worth mentioning briefly is materialized aggregates. As discussed earlier, data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that queries use most often?

创建此类缓存的一种方法是物化视图。在关系数据模型中,它通常被定义为标准(虚拟)视图:一个类似表的对象,其内容是某些查询的结果。不同之处在于,物化视图是查询结果的实际副本,写入磁盘,而虚拟视图只是写入查询的快捷方式。当您从虚拟视图中读取数据时,SQL 引擎会将其动态扩展为视图的基础查询,然后处理扩展后的查询。

One way of creating such a cache is a materialized view. In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query. The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries. When you read from a virtual view, the SQL engine expands it into the view’s underlying query on the fly and then processes the expanded query.

当底层数据发生变化时,物化视图需要更新,因为它是数据的非规范化副本。数据库可以自动执行此操作,但此类更新会使写入成本更高,这就是 OLTP 数据库中不经常使用物化视图的原因。在读取密集型数据仓库中,它们更有意义(它们是否真正提高读取性能取决于具体情况)。

When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data. The database can do that automatically, but such updates make writes more expensive, which is why materialized views are not often used in OLTP databases. In read-heavy data warehouses they can make more sense (whether or not they actually improve read performance depends on the individual case).

物化视图的一个常见特例称为数据立方体OLAP 立方体 [ 64 ]。它是按不同维度分组的聚合网格。图 3-12显示了一个示例。

A common special case of a materialized view is known as a data cube or OLAP cube [64]. It is a grid of aggregates grouped by different dimensions. Figure 3-12 shows an example.

直达0312
图 3-12。数据立方体的两个维度,通过求和来聚合数据。

现在想象一下,每个事实都只有两个维度表的外键 - 在 图 3-12中,它们是日期产品。现在,您可以绘制一个二维表格,其中一个轴为日期,另一轴为产品。每个单元格包含具有该日期产品组合的所有事实SUM的属性(例如 )的聚合(例如 )。net_price然后,您可以沿每一行或每一列应用相同的聚合,并获得已减少一个维度的摘要(无论日期如何,按产品列出的销售额,或无论产品如何按日期列出的销售额)。

Imagine for now that each fact has foreign keys to only two dimension tables—in Figure 3-12, these are date and product. You can now draw a two-dimensional table, with dates along one axis and products along the other. Each cell contains the aggregate (e.g., SUM) of an attribute (e.g., net_price) of all facts with that date-product combination. Then you can apply the same aggregate along each row or column and get a summary that has been reduced by one dimension (the sales by product regardless of date, or the sales by date regardless of product).

一般来说,事实通常具有两个以上的维度。图3-9中有五个维度:日期、产品、商店、促销和客户。想象五维超立方体会是什么样子要困难得多,但原理仍然是一样的:每个单元格包含特定日期-产品-商店-促销-客户组合的销售额。然后可以沿着每个维度重复总结这些值。

In general, facts often have more than two dimensions. In Figure 3-9 there are five dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a five-dimensional hypercube would look like, but the principle remains the same: each cell contains the sales for a particular date-product-store-promotion-customer combination. These values can then repeatedly be summarized along each of the dimensions.

物化数据立方体的优点是某些查询变得非常快,因为它们已被有效地预先计算。例如,如果您想了解昨天每个商店的总销售额,您只需查看相应维度的总计即可,无需扫描数百万行。

The advantage of a materialized data cube is that certain queries become very fast because they have effectively been precomputed. For example, if you want to know the total sales per store yesterday, you just need to look at the totals along the appropriate dimension—no need to scan millions of rows.

缺点是数据立方体不具有与查询原始数据相同的灵活性。例如,无法计算售价超过 100 美元的商品占销售额的比例,因为价格不是维度之一。因此,大多数数据仓库都会尝试保留尽可能多的原始数据,并仅使用数据立方体等聚合来提高某些查询的性能。

The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data. For example, there is no way of calculating which proportion of sales comes from items that cost more than $100, because the price isn’t one of the dimensions. Most data warehouses therefore try to keep as much raw data as possible, and use aggregates such as data cubes only as a performance boost for certain queries.

概括

Summary

在本章中,我们试图深入了解数据库如何处理存储和检索。当你将数据存储到数据库中时会发生什么,当你稍后再次查询数据时数据库会做什么?

In this chapter we tried to get to the bottom of how databases handle storage and retrieval. What happens when you store data in a database, and what does the database do when you query for the data again later?

在较高的层面上,我们看到存储引擎分为两大类:针对事务处理(OLTP)优化的存储引擎和针对分析(OLAP)优化的存储引擎。这些用例中的访问模式之间存在很大差异:

On a high level, we saw that storage engines fall into two broad categories: those optimized for transaction processing (OLTP), and those optimized for analytics (OLAP). There are big differences between the access patterns in those use cases:

  • OLTP 系统通常面向用户,这意味着它们可能会看到大量请求。为了处理负载,应用程序通常在每个查询中只触及少量记录。应用程序使用某种键请求记录,存储引擎使用索引来查找所请求键的数据。磁盘寻道时间通常是这里的瓶颈。

  • OLTP systems are typically user-facing, which means that they may see a huge volume of requests. In order to handle the load, applications usually only touch a small number of records in each query. The application requests records using some kind of key, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here.

  • 数据仓库和类似的分析系统不太为人所知,因为它们主要由业务分析师而不是最终用户使用。它们处理的查询量比 OLTP 系统少得多,但每个查询通常要求非常高,需要在短时间内扫描数百万条记录。磁盘带宽(而不是寻道时间)通常是这里的瓶颈,而面向列的存储是此类工作负载越来越流行的解决方案。

  • Data warehouses and similar analytic systems are less well known, because they are primarily used by business analysts, not by end users. They handle a much lower volume of queries than OLTP systems, but each query is typically very demanding, requiring many millions of records to be scanned in a short time. Disk bandwidth (not seek time) is often the bottleneck here, and column-oriented storage is an increasingly popular solution for this kind of workload.

在 OLTP 方面,我们看到了两个主要思想流派的存储引擎:

On the OLTP side, we saw storage engines from two main schools of thought:

  • 日志结构学派,只允许追加文件和删除过时的文件,但从不更新已写入的文件。Bitcask、SSTables、LSM-trees、LevelDB、Cassandra、HBase、Lucene 等都属于该组。

  • The log-structured school, which only permits appending to files and deleting obsolete files, but never updates a file that has been written. Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene, and others belong to this group.

  • 就地更新学派,将磁盘视为一组可以覆盖的固定大小的页面。B 树是这种哲学的最大例子,被用于所有主要的关系数据库以及许多非关系数据库。

  • The update-in-place school, which treats the disk as a set of fixed-size pages that can be overwritten. B-trees are the biggest example of this philosophy, being used in all major relational databases and also many nonrelational ones.

日志结构存储引擎是一个相对较新的发展。他们的关键思想是系统地将随机访问写入转变为磁盘上的顺序写入,由于硬盘驱动器和 SSD 的性能特征,这可以实现更高的写入吞吐量。

Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.

结束 OLTP 方面后,我们简要浏览了一些更复杂的索引结构,以及经过优化以将所有数据保留在内存中的数据库。

Finishing off the OLTP side, we did a brief tour through some more complicated indexing structures, and databases that are optimized for keeping all data in memory.

然后,我们绕道存储引擎的内部,看看典型数据仓库的高层架构。这一背景说明了为什么分析工作负载与 OLTP 如此不同:当您的查询需要顺序扫描大量行时,索引的相关性要低得多。相反,非常紧凑地编码数据变得很重要,以最大限度地减少查询需要从磁盘读取的数据量。我们讨论了面向列的存储如何帮助实现这一目标。

We then took a detour from the internals of storage engines to look at the high-level architecture of a typical data warehouse. This background illustrated why analytic workloads are so different from OLTP: when your queries require sequentially scanning across a large number of rows, indexes are much less relevant. Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk. We discussed how column-oriented storage helps achieve this goal.

作为应用程序开发人员,如果您掌握了有关存储引擎内部结构的知识,那么您就可以更好地了解哪种工具最适合您的特定应用程序。如果您需要调整数据库的调整参数,这种理解可以让您想象更高或更低的值可能会产生什么影响。

As an application developer, if you’re armed with this knowledge about the internals of storage engines, you are in a much better position to know which tool is best suited for your particular application. If you need to adjust a database’s tuning parameters, this understanding allows you to imagine what effect a higher or a lower value may have.

尽管本章无法使您成为调优任何特定存储引擎的专家,但它希望为您提供足够的词汇和想法,使您能够理解您选择的数据库的文档。

Although this chapter couldn’t make you an expert in tuning any one particular storage engine, it has hopefully equipped you with enough vocabulary and ideas that you can make sense of the documentation for the database of your choice.

脚注

i如果所有键和值都有固定大小,则可以在段文件上使用二分搜索并完全避免内存中索引。然而,在实践中它们通常是可变长度的,如果没有索引,则很难判断一条记录在哪里结束以及下一条记录在哪里开始。

i If all keys and values had a fixed size, you could use binary search on a segment file and avoid the in-memory index entirely. However, they are usually variable-length in practice, which makes it difficult to tell where one record ends and the next one starts if you don’t have an index.

ii将新密钥插入 B 树相当直观,但删除一个新密钥(同时保持树平衡)有点复杂 [ 2 ]。

ii Inserting a new key into a B-tree is reasonably intuitive, but deleting one (while keeping the tree balanced) is somewhat more involved [2].

iii此变体有时被称为 B +树,尽管优化非常常见,以至于通常无法将其与其他 B 树变体区分开来。

iii This variant is sometimes known as a B+ tree, although the optimization is so common that it often isn’t distinguished from other B-tree variants.

iv OLAP中online的含义 不清楚;它可能指的是这样一个事实:查询不仅仅针对预定义的报告,而且分析师还可以交互地使用 OLAP 系统进行探索性查询。

iv The meaning of online in OLAP is unclear; it probably refers to the fact that queries are not just for predefined reports, but that analysts use the OLAP system interactively for explorative queries.

参考

[ 1 ] Alfred V. Aho、John E. Hopcroft 和 Jeffrey D. Ullman: 数据结构和算法。艾迪生·韦斯利,1983 年。ISBN:978-0-201-00023-8

[1] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman: Data Structures and Algorithms. Addison-Wesley, 1983. ISBN: 978-0-201-00023-8

[ 2 ] Thomas H. Cormen、Charles E. Leiserson、Ronald L. Rivest 和 Clifford Stein:算法简介,第 3 版。麻省理工学院出版社,2009 年。ISBN:978-0-262-53305-8

[2] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: Introduction to Algorithms, 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8

[ 3 ] Justin Sheehy 和 David Smith:“ Bitcask:用于快速键/值数据的日志结构哈希表”,Basho Technologies,2010 年 4 月。

[3] Justin Sheehy and David Smith: “Bitcask: A Log-Structured Hash Table for Fast Key/Value Data,” Basho Technologies, April 2010.

[ 4 ] Yinan Li、Bingsheng He、Robin Jun Yang 等人:“固态硬盘上的树索引”, VLDB Endowment 论文集,第 3 卷,第 1 期,第 1195-1206 页,2010 年 9 月。

[4] Yinan Li, Bingsheng He, Robin Jun Yang, et al.: “Tree Indexing on Solid State Drives,” Proceedings of the VLDB Endowment, volume 3, number 1, pages 1195–1206, September 2010.

[ 5 ] Goetz Graefe:“现代 B 树技术”, 数据库基础与趋势,第 3 卷,第 4 期,第 203–402 页,2011 年 8 月 。doi:10.1561/1900000028

[5] Goetz Graefe: “Modern B-Tree Techniques,” Foundations and Trends in Databases, volume 3, number 4, pages 203–402, August 2011. doi:10.1561/1900000028

[ 6 ] Jeffrey Dean 和 Sanjay Ghemawat:“ LevelDB 实施说明”, leveldb.googlecode.com

[6] Jeffrey Dean and Sanjay Ghemawat: “LevelDB Implementation Notes,” leveldb.googlecode.com.

[ 7 ] Dhruba Borthakur:“ RocksDB 的历史”, rocksdb.blogspot.com,2013 年 11 月 24 日。

[7] Dhruba Borthakur: “The History of RocksDB,” rocksdb.blogspot.com, November 24, 2013.

[ 8 ] Matteo Bertozzi:“ Apache HBase I/O – HFile ”,blog.cloudera.com,2012 年 6 月 29 日。

[8] Matteo Bertozzi: “Apache HBase I/O – HFile,” blog.cloudera.com, June, 29 2012.

[ 9 ] Fay Chang、Jeffrey Dean、Sanjay Ghemawat 等人:“ Bigtable:结构化数据的分布式存储系统”,第 7 届 USENIX 操作系统设计与实现(OSDI) 研讨会,2006 年 11 月。

[9] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “Bigtable: A Distributed Storage System for Structured Data,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[ 10 ] Patrick O'Neil、Edward Cheng、Dieter Gawlick 和 Elizabeth O'Neil:“日志结构合并树 (LSM-Tree) ”, Acta Informatica,第 33 卷,第 4 期,第 351-385 页,6 月1996.doi :10.1007/s002360050048

[10] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil: “The Log-Structured Merge-Tree (LSM-Tree),” Acta Informatica, volume 33, number 4, pages 351–385, June 1996. doi:10.1007/s002360050048

[ 11 ] Mendel Rosenblum 和 John K. Ousterhout:“日志结构文件系统的设计和实现”, ACM Transactions on Computer Systems,第 10 卷,第 1 期,第 26-52 页,1992 年 2 月 。doi:10.1145/146941.146943

[11] Mendel Rosenblum and John K. Ousterhout: “The Design and Implementation of a Log-Structured File System,” ACM Transactions on Computer Systems, volume 10, number 1, pages 26–52, February 1992. doi:10.1145/146941.146943

[ 12 ] Adrien Grand:“ Lucene 索引中有什么?”,Lucene/Solr Revolution,2013 年 11 月 14 日。

[12] Adrien Grand: “What Is in a Lucene Index?,” at Lucene/Solr Revolution, November 14, 2013.

[ 13 ] Deepak Kandepet:“黑客 Lucene — 索引格式”,hackerlabs.org,2011 年 10 月 1 日。

[13] Deepak Kandepet: “Hacking Lucene—The Index Format,” hackerlabs.org, October 1, 2011.

[ 14 ] Michael McCandless:“可视化 Lucene 的段合并”,blog.mikemccandless.com,2011 年 2 月 11 日。

[14] Michael McCandless: “Visualizing Lucene’s Segment Merges,” blog.mikemccandless.com, February 11, 2011.

[ 15 ] Burton H. Bloom:“允许错误的哈希编码中的空间/时间权衡”,ACM 通讯,第 13 卷,第 7 期,第 422–426 页,1970 年 7 月 。doi:10.1145/362686.362692

[15] Burton H. Bloom: “Space/Time Trade-offs in Hash Coding with Allowable Errors,” Communications of the ACM, volume 13, number 7, pages 422–426, July 1970. doi:10.1145/362686.362692

[ 16 ]“操作 Cassandra:压缩”,Apache Cassandra 文档 v4.0,2016 年。

[16] “Operating Cassandra: Compaction,” Apache Cassandra Documentation v4.0, 2016.

[ 17 ] Rudolf Bayer 和 Edward M. McCreight:“大型有序索引的组织和维护”,波音科学研究实验室,数学和信息科学实验室,报告编号 17。1970 年 7 月 20 日。

[17] Rudolf Bayer and Edward M. McCreight: “Organization and Maintenance of Large Ordered Indices,” Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970.

[ 18 ] Douglas Comer:“无处不在的 B 树”,ACM 计算调查,第 11 卷,第 2 期,第 121–137 页,1979 年 6 月 。doi:10.1145/356770.356776

[18] Douglas Comer: “The Ubiquitous B-Tree,” ACM Computing Surveys, volume 11, number 2, pages 121–137, June 1979. doi:10.1145/356770.356776

[ 19 ] Emmanuel Goossaert:“ SSD 编码”,codecapsule.com,2014 年 2 月 12 日。

[19] Emmanuel Goossaert: “Coding for SSDs,” codecapsule.com, February 12, 2014.

[ 20 ] C. Mohan 和 Frank Levine:“ ARIES/IM:使用预写日志记录的高效高并发索引管理方法”,ACM 国际数据管理会议(SIGMOD),1992 年 6 月 。doi:10.1145/130283.130338

[20] C. Mohan and Frank Levine: “ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging,” at ACM International Conference on Management of Data (SIGMOD), June 1992. doi:10.1145/130283.130338

[ 21 ] Howard Chu:“ LDAP 以闪电般的速度”,Build Stuff '14,2014 年 11 月。

[21] Howard Chu: “LDAP at Lightning Speed,” at Build Stuff ’14, November 2014.

[ 22 ] Bradley C. Kuszmaul:“分形树与日志结构合并 (LSM) 树的比较”,tokutek.com,2014 年 4 月 22 日。

[22] Bradley C. Kuszmaul: “A Comparison of Fractal Trees to Log-Structured Merge (LSM) Trees,” tokutek.com, April 22, 2014.

[ 23 ]Manos Athanassoulis、Michael S. Kester、Lukas M. Maas 等人:“设计访问方法:RUM 猜想”,第 19 届国际扩展数据库技术会议(EDBT),2016 年 3 月 。doi:10.5441/002 /edbt.2016.42

[23] Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, et al.: “Designing Access Methods: The RUM Conjecture,” at 19th International Conference on Extending Database Technology (EDBT), March 2016. doi:10.5441/002/edbt.2016.42

[ 24 ] Peter Zaitsev:“ Innodb Double Write ”, percona.com,2006 年 8 月 4 日。

[24] Peter Zaitsev: “Innodb Double Write,” percona.com, August 4, 2006.

[ 25 ] Tomas Vondra:“论全页写入的影响”,blog.2ndquadrant.com,2016 年 11 月 23 日。

[25] Tomas Vondra: “On the Impact of Full-Page Writes,” blog.2ndquadrant.com, November 23, 2016.

[ 26 ]Mark Callaghan:“ LSM 与 B 树的优势”,smalldatum.blogspot.co.uk,2016 年 1 月 19 日。

[26] Mark Callaghan: “The Advantages of an LSM vs a B-Tree,” smalldatum.blogspot.co.uk, January 19, 2016.

[ 27 ]Mark Callaghan:“使用 RocksDB 在效率和性能之间进行选择”,Code Mesh,2016 年 11 月 4 日。

[27] Mark Callaghan: “Choosing Between Efficiency and Performance with RocksDB,” at Code Mesh, November 4, 2016.

[ 28 ] Michi Mutsuzaki:“ MySQL 与 LevelDB ”, github.com,2011 年 8 月。

[28] Michi Mutsuzaki: “MySQL vs. LevelDB,” github.com, August 2011.

[ 29 ]Benjamin Coverston、Jonathan Ellis 等人:“ CASSANDRA-1608:重新设计的压缩”issues.apache.org,2011 年 7 月。

[29] Benjamin Coverston, Jonathan Ellis, et al.: “CASSANDRA-1608: Redesigned Compaction, issues.apache.org, July 2011.

[ 30 ] Igor Canadi、Siying Dong 和 Mark Callaghan:“ RocksDB 调优指南”, github.com,2016 年。

[30] Igor Canadi, Siying Dong, and Mark Callaghan: “RocksDB Tuning Guide,” github.com, 2016.

[ 31 ] MySQL 5.7 参考手册。甲骨文,2014 年。

[31] MySQL 5.7 Reference Manual. Oracle, 2014.

[ 32 ] SQL Server 2012 在线书籍。微软,2012。

[32] Books Online for SQL Server 2012. Microsoft, 2012.

[ 33 ] Joe Webb:“使用覆盖索引提高查询性能”,simple-talk.com,2008 年 9 月 29 日。

[33] Joe Webb: “Using Covering Indexes to Improve Query Performance,” simple-talk.com, 29 September 2008.

[ 34 ] Frank Ramsak、Volker Markl、Robert Fenk 等人:“将 UB-Tree 集成到数据库系统内核中”,第 26 届国际超大型数据库会议(VLDB),2000 年 9 月。

[34] Frank Ramsak, Volker Markl, Robert Fenk, et al.: “Integrating the UB-Tree into a Database System Kernel,” at 26th International Conference on Very Large Data Bases (VLDB), September 2000.

[ 35 ] PostGIS 开发小组:“ PostGIS 2.1.2dev 手册”, postgis.net,2014 年。

[35] The PostGIS Development Group: “PostGIS 2.1.2dev Manual,” postgis.net, 2014.

[ 36 ] Robert Escriva、Bernard Wong 和 Emin Gün Sirer:“ HyperDex:分布式、可搜索的键值存储” , ACM SIGCOMM 会议,2012 年 8 月 。doi:10.1145/2377677.2377681

[36] Robert Escriva, Bernard Wong, and Emin Gün Sirer: “HyperDex: A Distributed, Searchable Key-Value Store,” at ACM SIGCOMM Conference, August 2012. doi:10.1145/2377677.2377681

[ 37 ] Michael McCandless:“ Lucene 的 FuzzyQuery 在 4.0 中快了 100 倍”,blog.mikemccandless.com,2011 年 3 月 24 日。

[37] Michael McCandless: “Lucene’s FuzzyQuery Is 100 Times Faster in 4.0,” blog.mikemccandless.com, March 24, 2011.

[ 38 ] Steffen Heinz、Justin Zobel 和 Hugh E. Williams:“ Burst Tries:一种快速、高效的字符串键数据结构”, ACM Transactions on Information Systems,第 20 卷,第 2 期,第 192-223 页,2002 年 4 月。 号码:10.1145/506309.506312

[38] Steffen Heinz, Justin Zobel, and Hugh E. Williams: “Burst Tries: A Fast, Efficient Data Structure for String Keys,” ACM Transactions on Information Systems, volume 20, number 2, pages 192–223, April 2002. doi:10.1145/506309.506312

[ 39 ] Klaus U. Schulz 和 Stoyan Mihov:“使用 Levenshtein Automata 进行快速字符串校正”, International Journal on Document Analysis and Recognition,第 5 卷,第 1 期,第 67-85 页,2002 年 11 月 。doi:10.1007/s10032-002- 0082-8

[39] Klaus U. Schulz and Stoyan Mihov: “Fast String Correction with Levenshtein Automata,” International Journal on Document Analysis and Recognition, volume 5, number 1, pages 67–85, November 2002. doi:10.1007/s10032-002-0082-8

[ 40 ] Christopher D. Manning、Prabhakar Raghavan 和 Hinrich Schütze: 信息检索简介。剑桥大学出版社,2008 年。ISBN:978-0-521-86571-5,可在nlp.stanford.edu/IR-book在线获取

[40] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at nlp.stanford.edu/IR-book

[ 41 ] Michael Stonebraker、Samuel Madden、Daniel J. Abadi 等人:“建筑时代的终结(是时候进行彻底重写了) ”, 第 33 届超大型数据库国际会议(VLDB),2007 年 9 月。

[41] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “The End of an Architectural Era (It’s Time for a Complete Rewrite),” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[ 42 ]“ VoltDB 技术概述白皮书”,VoltDB,2014 年。

[42] “VoltDB Technical Overview White Paper,” VoltDB, 2014.

[ 43 ] Stephen M. Rumble、Ankita Kejriwal 和 John K. Ousterhout:“基于DRAM 存储的日志结构内存”,第 12 届 USENIX 文件和存储技术会议(FAST),2014 年 2 月。

[43] Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout: “Log-Structured Memory for DRAM-Based Storage,” at 12th USENIX Conference on File and Storage Technologies (FAST), February 2014.

[ 44 ] Stavros Harizopoulos、Daniel J. Abadi、Samuel Madden 和 Michael Stonebraker:“ OLTP Through the Looking Glass,以及我们在那里发现的内容”,ACM 国际数据管理会议 (SIGMOD),2008 年 6 月 。doi:10.1145 /1376616.1376713

[44] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: “OLTP Through the Looking Glass, and What We Found There,” at ACM International Conference on Management of Data (SIGMOD), June 2008. doi:10.1145/1376616.1376713

[ 45 ] Justin DeBrabant、Andrew Pavlo、Stephen Tu 等人:“反缓存:数据库管理系统架构的新方法”,VLDB Endowment 会议记录,第 6 卷,第 14 期,第 1942-1953 页,2013 年 9 月。

[45] Justin DeBrabant, Andrew Pavlo, Stephen Tu, et al.: “Anti-Caching: A New Approach to Database Management System Architecture,” Proceedings of the VLDB Endowment, volume 6, number 14, pages 1942–1953, September 2013.

[ 46 ] Joy Arulraj、Andrew Pavlo 和 Subramanya R. Dulloor:“让我们谈谈非易失性内存数据库系统的存储和恢复方法”,ACM 国际数据管理会议(SIGMOD),2015 年 6 月 。doi:10.1145 /2723372.2749441

[46] Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor: “Let’s Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2749441

[ 47 ] Edgar F. Codd、SB Codd 和 CT Salley:“向用户分析师提供 OLAP:一项 IT 任务”,EF Codd Associates,1993 年。

[47] Edgar F. Codd, S. B. Codd, and C. T. Salley: “Providing OLAP to User-Analysts: An IT Mandate,” E. F. Codd Associates, 1993.

[ 48 ] Surajit Chaudhuri 和 Umeshwar Dayal:“数据仓库和 OLAP 技术概述”,ACM SIGMOD Record,第 26 卷,第 1 期,第 65-74 页,1997 年 3 月。doi:10.1145/248603.248616

[48] Surajit Chaudhuri and Umeshwar Dayal: “An Overview of Data Warehousing and OLAP Technology,” ACM SIGMOD Record, volume 26, number 1, pages 65–74, March 1997. doi:10.1145/248603.248616

[ 49 ] Per-Åke Larson、Cipri Clinciu、Campbell Fraser 等人:“ SQL Server 列存储的增强”,ACM 国际数据管理会议 (SIGMOD),2013 年 6 月。

[49] Per-Åke Larson, Cipri Clinciu, Campbell Fraser, et al.: “Enhancements to SQL Server Column Stores,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[ 50 ] Franz Färber、Norman May、Wolfgang Lehner 等人:“ SAP HANA 数据库 – 架构概述”, IEEE 数据工程公告,第 35 卷,第 1 期,第 28-33 页,2012 年 3 月。

[50] Franz Färber, Norman May, Wolfgang Lehner, et al.: “The SAP HANA Database – An Architecture Overview,” IEEE Data Engineering Bulletin, volume 35, number 1, pages 28–33, March 2012.

[ 51 ] Michael Stonebraker:“传统的 RDBMS 智慧(几乎肯定)都是错误的”,EPFL演讲,2013 年 5 月。

[51] Michael Stonebraker: “The Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong,” presentation at EPFL, May 2013.

[ 52 ] Daniel J. Abadi:“对 SQL-on-Hadoop 解决方案进行分类”,hadapt.com,2013 年 10 月 2 日。

[52] Daniel J. Abadi: “Classifying the SQL-on-Hadoop Solutions,” hadapt.com, October 2, 2013.

[ 53 ] Marcel Kornacker、Alexander Behm、Victor Bittorf 等人:“ Impala:用于 Hadoop 的现代开源 SQL 引擎”,第七届创新数据系统研究双年会(CIDR),2015 年 1 月。

[53] Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “Impala: A Modern, Open-Source SQL Engine for Hadoop,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[ 54 ] Sergey Melnik、Andrey Gubarev、Jing Jing Long 等人:“ Dremel:Web 规模数据集的交互式分析”,第36 届超大型数据库国际会议(VLDB),第 330-339 页,2010 年 9 月。

[54] Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al.: “Dremel: Interactive Analysis of Web-Scale Datasets,” at 36th International Conference on Very Large Data Bases (VLDB), pages 330–339, September 2010.

[ 55 ] Ralph Kimball 和 Margy Ross: 数据仓库工具包:维度建模权威指南,第三版。约翰·威利父子公司,2013 年 7 月。ISBN:978-1-118-53080-1

[55] Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edition. John Wiley & Sons, July 2013. ISBN: 978-1-118-53080-1

[ 56 ] Derrick Harris:“为什么 Apple、eBay 和 Walmart 拥有您所见过的最大的数据仓库”, gigaom.com,2013 年 3 月 27 日。

[56] Derrick Harris: “Why Apple, eBay, and Walmart Have Some of the Biggest Data Warehouses You’ve Ever Seen,” gigaom.com, March 27, 2013.

[ 57 ] Julien Le Dem:“ Dremel 用镶木地板让一切变得简单”, blog.twitter.com,2013 年 9 月 11 日。

[57] Julien Le Dem: “Dremel Made Simple with Parquet,” blog.twitter.com, September 11, 2013.

[ 58 ] Daniel J. Abadi、Peter Boncz、Stavros Harizopoulos 等人:“现代列式数据库系统的设计和实现”,数据库基础与趋势,第 5 卷,第 3 期,第 197-280 页,12 月2013.doi :10.1561/1900000024

[58] Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, et al.: “The Design and Implementation of Modern Column-Oriented Database Systems,” Foundations and Trends in Databases, volume 5, number 3, pages 197–280, December 2013. doi:10.1561/1900000024

[ 59 ]Peter Boncz、Marcin Zukowski 和 Niels Nes:“ MonetDB/X100:超管道查询执行”,第二届创新数据系统研究双年度会议(CIDR),2005 年 1 月。

[59] Peter Boncz, Marcin Zukowski, and Niels Nes: “MonetDB/X100: Hyper-Pipelining Query Execution,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.

[ 60 ] Jingren Zhou 和 Kenneth A. Ross:“使用 SIMD 指令实现数据库操作”,ACM 国际数据管理会议(SIGMOD),第 145–156 页,2002 年 6 月 。doi:10.1145/564691.564709

[60] Jingren Zhou and Kenneth A. Ross: “Implementing Database Operations Using SIMD Instructions,” at ACM International Conference on Management of Data (SIGMOD), pages 145–156, June 2002. doi:10.1145/564691.564709

[ 61 ] Michael Stonebraker、Daniel J. Abadi、Adam Batkin 等人:“ C-Store:面向列的 DBMS ”,第31 届超大型数据库国际会议(VLDB),第 553-564 页,2005 年 9 月。

[61] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, et al.: “C-Store: A Column-oriented DBMS,” at 31st International Conference on Very Large Data Bases (VLDB), pages 553–564, September 2005.

[ 62 ] Andrew Lamb、Matt Fuller、Ramakrishna Varadarajan 等人:“ Vertica 分析数据库:7 年后的便利店” , VLDB 捐赠论文集,第 5 卷,第 12 期,第 1790-1801 页,2012 年 8 月。

[62] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, et al.: “The Vertica Analytic Database: C-Store 7 Years Later,” Proceedings of the VLDB Endowment, volume 5, number 12, pages 1790–1801, August 2012.

[ 63 ] Julien Le Dem 和 Nong Li:“利用 Apache Parquet 2.0 进行高效数据存储分析” ,2014 年 6 月在圣何塞举行的Hadoop 峰会上。

[63] Julien Le Dem and Nong Li: “Efficient Data Storage for Analytics with Apache Parquet 2.0,” at Hadoop Summit, San Jose, June 2014.

[ 64 ] Jim Gray、Surajit Chaudhuri、Adam Bosworth 等人:“数据立方体:通用分组、交叉表和小计的关系聚合运算符”,数据挖掘和知识发现,第 1 卷,第 1 期,第 29–53 页,2007 年 3 月 。doi:10.1023/A:1009726021843

[64] Jim Gray, Surajit Chaudhuri, Adam Bosworth, et al.: “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals,” Data Mining and Knowledge Discovery, volume 1, number 1, pages 29–53, March 2007. doi:10.1023/A:1009726021843

第 4 章编码和进化

Chapter 4. Encoding and Evolution

一切都在变化,没有什么是静止的。

以弗所的赫拉克利特,柏拉图在《克拉提罗斯》中引用(公元前 360 年)

Everything changes and nothing stands still.

Heraclitus of Ephesus, as quoted by Plato in Cratylus (360 BCE)

应用程序不可避免地会随着时间的推移而发生变化。随着新产品的推出、对用户需求的更好理解或业务环境的变化,功能会被添加或修改。在 第一章中,我们介绍了可进化性的概念:我们的目标应该是构建能够轻松适应变化的系统(参见“可进化性:让改变变得容易”)。

Applications inevitably change over time. Features are added or modified as new products are launched, user requirements become better understood, or business circumstances change. In Chapter 1 we introduced the idea of evolvability: we should aim to build systems that make it easy to adapt to change (see “Evolvability: Making Change Easy”).

在大多数情况下,对应用程序功能的更改还需要更改其存储的数据:可能需要捕获新的字段或记录类型,或者可能需要以新的方式呈现现有数据。

In most cases, a change to an application’s features also requires a change to data that it stores: perhaps a new field or record type needs to be captured, or perhaps existing data needs to be presented in a new way.

我们在第 2 章 中讨论的数据模型有不同的方式来应对这种变化。关系数据库通常假设数据库中的所有数据都符合一种模式:尽管该模式可以更改(通过模式迁移;即ALTER语句),但在任何一个时间点都只有一种有效的模式。相比之下,读取时模式(“无模式”)数据库不强制执行模式,因此数据库可以包含在不同时间写入的旧数据格式和新数据格式的混合(请参阅“文档模型中的模式灵活性”

The data models we discussed in Chapter 2 have different ways of coping with such change. Relational databases generally assume that all data in the database conforms to one schema: although that schema can be changed (through schema migrations; i.e., ALTER statements), there is exactly one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases don’t enforce a schema, so the database can contain a mixture of older and newer data formats written at different times (see “Schema flexibility in the document model”).

当数据格式或模式发生更改时,通常需要对应用程序代码进行相应的更改(例如,向记录添加新字段,然后应用程序代码开始读取和写入该字段)。然而,在大型应用程序中,代码更改通常无法立即发生:

When a data format or schema changes, a corresponding change to application code often needs to happen (for example, you add a new field to a record, and the application code starts reading and writing that field). However, in a large application, code changes often cannot happen instantaneously:

  • 对于服务器端应用程序,您可能想要执行 滚动升级 (也称为分阶段部署),将新版本一次部署到几个节点上,检查新版本是否运行顺利,然后逐步在所有节点上进行升级。这允许在不停机的情况下部署新版本,从而鼓励更频繁的发布和更好的可演进性。

  • With server-side applications you may want to perform a rolling upgrade (also known as a staged rollout), deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes. This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability.

  • 对于客户端应用程序,您将受到用户的摆布,他们可能在一段时间内不会安装更新。

  • With client-side applications you’re at the mercy of the user, who may not install the update for some time.

这意味着新旧版本的代码以及新旧数据格式可能同时在系统中共存。为了让系统继续顺利运行,我们需要保持两个方向的兼容性:

This means that old and new versions of the code, and old and new data formats, may potentially all coexist in the system at the same time. In order for the system to continue running smoothly, we need to maintain compatibility in both directions:

向后兼容性
Backward compatibility

新代码可以读取旧代码写入的数据。

Newer code can read data that was written by older code.

向前兼容性
Forward compatibility

较旧的代码可以读取较新代码写入的数据。

Older code can read data that was written by newer code.

向后兼容性通常并不难实现:作为新代码的作者,您知道旧代码编写的数据的格式,因此您可以显式处理它(如果需要,只需保留旧代码来读取旧数据)。前向兼容性可能比较棘手,因为它需要旧代码忽略新版本代码所做的添加。

Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, and so you can explicitly handle it (if necessary by simply keeping the old code to read the old data). Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code.

在本章中,我们将研究几种数据编码格式,包括 JSON、XML、Protocol Buffers、Thrift 和 Avro。特别是,我们将研究它们如何处理架构更改以及它们如何支持新旧数据和代码需要共存的系统。然后,我们将讨论如何将这些格式用于数据存储和通信:在 Web 服务、表述性状态传输 (REST) 和远程过程调用 (RPC) 以及消息传递系统(例如参与者和消息队列)中。

In this chapter we will look at several formats for encoding data, including JSON, XML, Protocol Buffers, Thrift, and Avro. In particular, we will look at how they handle schema changes and how they support systems where old and new data and code need to coexist. We will then discuss how those formats are used for data storage and for communication: in web services, Representational State Transfer (REST), and remote procedure calls (RPC), as well as message-passing systems such as actors and message queues.

数据编码格式

Formats for Encoding Data

程序通常使用(至少)两种不同表示形式的数据:

Programs usually work with data in (at least) two different representations:

  1. 在内存中,数据保存在对象、结构、列表、数组、哈希表、树等中。这些数据结构经过优化,可实现 CPU 的高效访问和操作(通常使用指针)。

  2. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These data structures are optimized for efficient access and manipulation by the CPU (typically using pointers).

  3. 当您想要将数据写入文件或通过网络发送数据时,必须将其编码为某种独立的字节序列(例如,JSON 文档)。由于指针对任何其他进程都没有意义,因此这种字节序列表示形式看起来与内存中通常使用的数据结构有很大不同。

  4. When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory.i

因此,我们需要在两种表示之间进行某种转换。从内存中表示形式到字节序列的转换称为编码(也称为序列化编组),相反的称为解码解析反序列化解组)。二、

Thus, we need some kind of translation between the two representations. The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling).ii

术语冲突

Terminology clash

不幸的是,序列化也用在事务的上下文中(参见第 7 章),具有完全不同的含义。为了避免这个词过多,我们将在本书中坚持使用编码,尽管序列化可能是一个更常见的术语。

Serialization is unfortunately also used in the context of transactions (see Chapter 7), with a completely different meaning. To avoid overloading the word we’ll stick with encoding in this book, even though serialization is perhaps a more common term.

由于这是一个常见问题,因此有无数不同的库和编码格式可供选择。让我们做一个简单的概述。

As this is such a common problem, there are a myriad different libraries and encoding formats to choose from. Let’s do a brief overview.

特定于语言的格式

Language-Specific Formats

许多编程语言都内置支持将内存中的对象编码为字节序列。例如,Java 有java.io.Serializable [ 1 ],Ruby 有Marshal [ 2 ],Python 有pickle [ 3 ],等等。还存在许多第三方库,例如 Kryo for Java [ 4 ]。

Many programming languages come with built-in support for encoding in-memory objects into byte sequences. For example, Java has java.io.Serializable [1], Ruby has Marshal [2], Python has pickle [3], and so on. Many third-party libraries also exist, such as Kryo for Java [4].

这些编码库非常方便,因为它们允许使用最少的额外代码来保存和恢复内存中的对象。然而,他们也存在一些深层次的问题:

These encoding libraries are very convenient, because they allow in-memory objects to be saved and restored with minimal additional code. However, they also have a number of deep problems:

  • 编码通常与特定的编程语言相关,并且用另一种语言读取数据非常困难。如果您以此类编码存储或传输数据,则您可能会在很长一段时间内使用当前的编程语言,并且无法将您的系统与其他组织(可能使用不同语言)的系统集成。

  • The encoding is often tied to a particular programming language, and reading the data in another language is very difficult. If you store or transmit data in such an encoding, you are committing yourself to your current programming language for potentially a very long time, and precluding integrating your systems with those of other organizations (which may use different languages).

  • 为了恢复相同对象类型的数据,解码过程需要能够实例化任意类。这通常是安全问题的根源[ 5 ]:如果攻击者可以让您的应用程序解码任意字节序列,他们就可以实例化任意类,这反过来通常允许他们做一些可怕的事情,例如远程执行任意代码[6 ]7 ]。

  • In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems [5]: if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to do terrible things such as remotely executing arbitrary code [6, 7].

  • 在这些库中,版本控制数据通常是事后才想到的:由于它们旨在快速、轻松地编码数据,因此它们经常忽略向前和向后兼容性的不便问题。

  • Versioning data is often an afterthought in these libraries: as they are intended for quick and easy encoding of data, they often neglect the inconvenient problems of forward and backward compatibility.

  • 效率(编码或解码所需的 CPU 时间以及编码结构的大小)也常常是事后才考虑的因素。例如,Java 的内置序列化因其糟糕的性能和臃肿的编码而臭名昭著[ 8 ]。

  • Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought. For example, Java’s built-in serialization is notorious for its bad performance and bloated encoding [8].

由于这些原因,将语言的内置编码用于除非常短暂的目的之外的任何目的通常是一个坏主意。

For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes.

JSON、XML 和二进制变体

JSON, XML, and Binary Variants

转向可由多种编程语言编写和读取的标准化编码,JSON 和 XML 是明显的竞争者。它们广为人知,广受支持,但也几乎同样广受厌恶。XML 经常因过于冗长和不必要的复杂而受到批评 [ 9 ]。JSON 的流行主要是由于它在 Web 浏览器中的内置支持(由于是 JavaScript 的子集)以及相对于 XML 的简单性。CSV 是另一种流行的独立于语言的格式,尽管功能较弱。

Moving to standardized encodings that can be written and read by many programming languages, JSON and XML are the obvious contenders. They are widely known, widely supported, and almost as widely disliked. XML is often criticized for being too verbose and unnecessarily complicated [9]. JSON’s popularity is mainly due to its built-in support in web browsers (by virtue of being a subset of JavaScript) and simplicity relative to XML. CSV is another popular language-independent format, albeit less powerful.

JSON、XML 和 CSV 是文本格式,因此在某种程度上是人类可读的(尽管语法是一个热门的争论话题)。除了表面的语法问题外,它们还存在一些微妙的问题:

JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a popular topic of debate). Besides the superficial syntactic issues, they also have some subtle problems:

  • 数字编码存在很多歧义。在 XML 和 CSV 中,您无法区分数字和恰好由数字组成的字符串(除非引用外部架构)。JSON 区分字符串和数字,但不区分整数和浮点数,并且不指定精度。

    当处理大量数据时,这是一个问题;例如,大于 2 53 的整数无法用 IEEE 754 双精度浮点数精确表示,因此当用使用浮点数的语言(例如 JavaScript)解析时,此类数字会变得不准确。Twitter 上有一个大于 2 53的数字示例,它使用 64 位数字来标识每条推文。Twitter 的 API 返回的 JSON 包含两次推文 ID,一次作为 JSON 数字,一次作为十进制字符串,以解决 JavaScript 应用程序无法正确解析数字的问题 [10 ]

  • There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string that happens to consist of digits (except by referring to an external schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers, and it doesn’t specify a precision.

    This is a problem when dealing with large numbers; for example, integers greater than 253 cannot be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become inaccurate when parsed in a language that uses floating-point numbers (such as JavaScript). An example of numbers larger than 253 occurs on Twitter, which uses a 64-bit number to identify each tweet. The JSON returned by Twitter’s API includes tweet IDs twice, once as a JSON number and once as a decimal string, to work around the fact that the numbers are not correctly parsed by JavaScript applications [10].

  • JSON 和 XML 对 Unicode 字符串(即人类可读的文本)有很好的支持,但它们不支持二进制字符串(没有字符编码的字节序列)。二进制字符串是一个有用的功能,因此人们通过使用 Base64 将二进制数据编码为文本来绕过此限制。然后,该架构用于指示该值应被解释为 Base64 编码。这可行,但有点 hacky,并且数据大小增加了 33%。

  • JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but it’s somewhat hacky and increases the data size by 33%.

  • XML [ 11 ] 和 JSON [ 12 ]都有可选的模式支持。这些模式语言非常强大,因此学习和实现起来相当复杂。XML 模式的使用相当广泛,但许多基于 JSON 的工具并不关心使用模式。由于数据(例如数字和二进制字符串)的正确解释取决于架构中的信息,因此不使用 XML/JSON 架构的应用程序可能需要硬编码适当的编码/解码逻辑。

  • There is optional schema support for both XML [11] and JSON [12]. These schema languages are quite powerful, and thus quite complicated to learn and implement. Use of XML schemas is fairly widespread, but many JSON-based tools don’t bother using schemas. Since the correct interpretation of data (such as numbers and binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas need to potentially hardcode the appropriate encoding/decoding logic instead.

  • CSV 没有任何模式,因此由应用程序来定义每行和每列的含义。如果应用程序更改添加了新行或新列,您必须手动处理该更改。CSV 也是一种相当模糊的格式(如果值包含逗号或换行符会发生什么?)。尽管其转义规则已被正式指定[ 13 ],但并非所有解析器都正确实现它们。

  • CSV does not have any schema, so it is up to the application to define the meaning of each row and column. If an application change adds a new row or column, you have to handle that change manually. CSV is also a quite vague format (what happens if a value contains a comma or a newline character?). Although its escaping rules have been formally specified [13], not all parsers implement them correctly.

尽管存在这些缺陷,JSON、XML 和 CSV 足以满足许多用途。它们很可能会继续流行,尤其是作为数据交换格式(即,用于将数据从一个组织发送到另一个组织)。在这些情况下,只要人们就格式达成一致,格式有多漂亮或多高效通常并不重要。让不同组织就任何事情达成一致的难度超过了大多数其他问题。

Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will remain popular, especially as data interchange formats (i.e., for sending data from one organization to another). In these situations, as long as people agree on what the format is, it often doesn’t matter how pretty or efficient the format is. The difficulty of getting different organizations to agree on anything outweighs most other concerns.

二进制编码

Binary encoding

对于仅在组织内部使用的数据,使用最小公分母编码格式的压力较小。例如,您可以选择更紧凑或解析速度更快的格式。对于小型数据集,收益可以忽略不计,但一旦达到 TB 级,数据格式的选择可能会产生很大的影响。

For data that is used only internally within your organization, there is less pressure to use a lowest-common-denominator encoding format. For example, you could choose a format that is more compact or faster to parse. For a small dataset, the gains are negligible, but once you get into the terabytes, the choice of data format can have a big impact.

JSON 比 XML 更简洁,但与二进制格式相比,两者仍然占用大量空间。这一观察导致了 JSON(MessagePack、BSON、BJSON、UBJSON、BISON 和 Smile 等)和 XML(例如 WBXML 和 Fast Infoset)的大量二进制编码的开发。这些格式已在各个领域得到采用,但没有一种格式像 JSON 和 XML 的文本版本那样被广泛采用。

JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This observation led to the development of a profusion of binary encodings for JSON (MessagePack, BSON, BJSON, UBJSON, BISON, and Smile, to name a few) and for XML (WBXML and Fast Infoset, for example). These formats have been adopted in various niches, but none of them are as widely adopted as the textual versions of JSON and XML.

其中一些格式扩展了数据类型集(例如,区分整数和浮点数,或添加对二进制字符串的支持),但在其他方面它们保持 JSON/XML 数据模型不变。特别是,由于它们没有规定模式,因此需要在编码数据中包含所有对象字段名称。也就是说,在示例 4-1中的 JSON 文档的二进制编码中,它们需要在某处包含字符串userNamefavoriteNumberinterests

Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers, or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In particular, since they don’t prescribe a schema, they need to include all the object field names within the encoded data. That is, in a binary encoding of the JSON document in Example 4-1, they will need to include the strings userName, favoriteNumber, and interests somewhere.

例 4-1。我们将在本章中以几种二进制格式编码的示例记录
{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}
{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

让我们看一个 MessagePack 的示例,它是 JSON 的二进制编码。图 4-1显示了使用 MessagePack [ 14 ]对示例 4-1 中的 JSON 文档进行编码时获得的字节序列。前几个字节如下:

Let’s look at an example of MessagePack, a binary encoding for JSON. Figure 4-1 shows the byte sequence that you get if you encode the JSON document in Example 4-1 with MessagePack [14]. The first few bytes are as follows:

  1. 第一个字节0x83表示接下来是一个0x80具有三个字段(底部四位 = )的对象(顶部四位 = 0x03)。(如果您想知道如果一个对象具有超过 15 个字段,导致字段数量无法容纳在 4 位中,会发生什么情况,那么它会获得不同的类型指示符,并且字段数量以两种或两种形式编码四个字节。)

  2. The first byte, 0x83, indicates that what follows is an object (top four bits = 0x80) with three fields (bottom four bits = 0x03). (In case you’re wondering what happens if an object has more than 15 fields, so that the number of fields doesn’t fit in four bits, it then gets a different type indicator, and the number of fields is encoded in two or four bytes.)

  3. 第二个字节0xa8表示接下来是一个0xa0八字节长(底部四位 = )的字符串(前四位 = 0x08)。

  4. The second byte, 0xa8, indicates that what follows is a string (top four bits = 0xa0) that is eight bytes long (bottom four bits = 0x08).

  5. 接下来的八个字节是 ASCII 格式的字段名称userName。由于之前已经指示了长度,因此不需要任何标记来告诉我们字符串的结束位置(或任何转义)。

  6. The next eight bytes are the field name userName in ASCII. Since the length was indicated previously, there’s no need for any marker to tell us where the string ends (or any escaping).

  7. Martin接下来的七个字节对带有前缀 的六个字母的字符串值进行编码0xa6,依此类推。

  8. The next seven bytes encode the six-letter string value Martin with a prefix 0xa6, and so on.

二进制编码长度为 66 个字节,仅比文本 JSON 编码(删除空格)的 81 个字节少一点。JSON 的所有二进制编码在这方面都是相似的。目前尚不清楚如此小的空间减少(或许还有解析加速)是否值得损失人类可读性。

The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the textual JSON encoding (with whitespace removed). All the binary encodings of JSON are similar in this regard. It’s not clear whether such a small space reduction (and perhaps a speedup in parsing) is worth the loss of human-readability.

在接下来的部分中,我们将了解如何做得更好,并将相同的记录编码为 32 个字节。

In the following sections we will see how we can do much better, and encode the same record in just 32 bytes.

迪迪亚0401
图 4-1。使用 MessagePack 编码的示例记录(示例 4-1 )。

Thrift 和协议缓冲区

Thrift and Protocol Buffers

Apache Thrift [ 15 ] 和 Protocol Buffers (protobuf) [ 16 ] 是基于相同原理的二进制编码库。Protocol Buffers 最初由 Google 开发,Thrift 最初由 Facebook 开发,两者均于 2007-08 年开源[ 17 ]。

Apache Thrift [15] and Protocol Buffers (protobuf) [16] are binary encoding libraries that are based on the same principle. Protocol Buffers was originally developed at Google, Thrift was originally developed at Facebook, and both were made open source in 2007–08 [17].

Thrift 和 Protocol Buffers 都需要一个用于编码数据的模式。要在 Thrift 中对示例 4-1中的数据进行编码,您可以使用 Thrift 接口定义语言 (IDL) 来描述模式,如下所示:

Both Thrift and Protocol Buffers require a schema for any data that is encoded. To encode the data in Example 4-1 in Thrift, you would describe the schema in the Thrift interface definition language (IDL) like this:

struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}
struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}

Protocol Buffers 的等效模式定义看起来非常相似:

The equivalent schema definition for Protocol Buffers looks very similar:

message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}
message Person {
    required string user_name       = 1;
    optional int64  favorite_number = 2;
    repeated string interests       = 3;
}

Thrift 和 Protocol Buffers 都带有一个代码生成工具,该工具采用如此处所示的模式定义,并生成以各种编程语言实现该模式的类 [ 18 ]。您的应用程序代码可以调用此生成的代码来对架构的记录进行编码或解码。

Thrift and Protocol Buffers each come with a code generation tool that takes a schema definition like the ones shown here, and produces classes that implement the schema in various programming languages [18]. Your application code can call this generated code to encode or decode records of the schema.

使用此模式编码的数据是什么样的?令人困惑的是,Thrift 有两种不同的二进制编码格式,分别 称为BinaryProtocolCompactProtocol。我们先看一下BinaryProtocol。该格式的编码例4-1需要59个字节,如图4-2 [ 19 ]所示 。

What does data encoded with this schema look like? Confusingly, Thrift has two different binary encoding formats,iii called BinaryProtocol and CompactProtocol, respectively. Let’s look at BinaryProtocol first. Encoding Example 4-1 in that format takes 59 bytes, as shown in Figure 4-2 [19].

迪迪亚0402
图 4-2。使用 Thrift 的 BinaryProtocol 编码的示例记录。

图 4-1类似,每个字段都有一个类型注释(以指示它是否是字符串、整数、列表等),并且在需要时还有长度指示(字符串的长度、列表中的项目数) 。数据中出现的字符串(“Martin”、“daydreaming”、“hacking”)也被编码为 ASCII(或者更确切地说,UTF-8),与之前类似。

Similarly to Figure 4-1, each field has a type annotation (to indicate whether it is a string, integer, list, etc.) and, where required, a length indication (length of a string, number of items in a list). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded as ASCII (or rather, UTF-8), similar to before.

图 4-1相比最大的区别是没有字段名称 ( userName, favoriteNumber, interests)。相反,编码数据包含字段标签,它们是数字(123)。这些是出现在架构定义中的数字。字段标签就像字段的别名 - 它们是一种简洁的方式来说明我们正在讨论的字段,而无需拼写出字段名称。

The big difference compared to Figure 4-1 is that there are no field names (userName, favoriteNumber, interests). Instead, the encoded data contains field tags, which are numbers (1, 2, and 3). Those are the numbers that appear in the schema definition. Field tags are like aliases for fields—they are a compact way of saying what field we’re talking about, without having to spell out the field name.

Thrift CompactProtocol 编码在语义上与 BinaryProtocol 等效,但如图4-3 所示,它将相同的信息打包到仅 34 个字节中。它通过将字段类型和标记号打包到单个字节中并使用可变长度整数来实现此目的。数字 1337 没有使用完整的八个字节,而是用两个字节进行编码,每个字节的最高位用于指示是否还有更多字节。这意味着 –64 到 63 之间的数字用一个字节编码,–8192 到 8191 之间的数字用两个字节编码,依此类推。数字越大,使用的字节越多。

The Thrift CompactProtocol encoding is semantically equivalent to BinaryProtocol, but as you can see in Figure 4-3, it packs the same information into only 34 bytes. It does this by packing the field type and tag number into a single byte, and by using variable-length integers. Rather than using a full eight bytes for the number 1337, it is encoded in two bytes, with the top bit of each byte used to indicate whether there are still more bytes to come. This means numbers between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes, etc. Bigger numbers use more bytes.

迪迪亚0403
图 4-3。使用 Thrift 的 CompactProtocol 编码的示例记录。

最后,Protocol Buffers(只有一种二进制编码格式)对相同的数据进行编码,如图4-4所示。它的位打包方式略有不同,但在其他方面与 Thrift 的 CompactProtocol 非常相似。Protocol Buffers 在 33 个字节中容纳相同的记录。

Finally, Protocol Buffers (which has only one binary encoding format) encodes the same data as shown in Figure 4-4. It does the bit packing slightly differently, but is otherwise very similar to Thrift’s CompactProtocol. Protocol Buffers fits the same record in 33 bytes.

迪迪亚0404
图 4-4。使用协议缓冲区编码的示例记录。

需要注意的一个细节:在前面显示的模式中,每个字段都被标记为 或requiredoptional但这对字段的编码方式没有影响(二进制数据中没有任何内容表明字段是否是必需的)。区别在于,required如果未设置该字段,则启用运行时检查会失败,这对于捕获错误非常有用。

One detail to note: in the schemas shown earlier, each field was marked either required or optional, but this makes no difference to how the field is encoded (nothing in the binary data indicates whether a field was required). The difference is simply that required enables a runtime check that fails if the field is not set, which can be useful for catching bugs.

字段标签和模式演变

Field tags and schema evolution

我们之前说过,模式不可避免地需要随着时间的推移而改变。我们称这种模式演化为模式演化。Thrift 和 Protocol Buffers 如何处理架构更改,同时保持向后和向前兼容性?

We said previously that schemas inevitably need to change over time. We call this schema evolution. How do Thrift and Protocol Buffers handle schema changes while keeping backward and forward compatibility?

正如您从示例中看到的,编码记录只是其编码字段的串联。每个字段都由其标签号(示例模式中的数字1, 2, )标识,并用数据类型(例如字符串或整数)进行注释。3如果未设置字段值,则只需从编码记录中省略该字段值。由此可以看出,字段标签对于编码数据的含义至关重要。您可以更改架构中字段的名称,因为编码数据从不引用字段名称,但您不能更改字段的标签,因为这会使所有现有编码数据无效

As you can see from the examples, an encoded record is just the concatenation of its encoded fields. Each field is identified by its tag number (the numbers 1, 2, 3 in the sample schemas) and annotated with a datatype (e.g., string or integer). If a field value is not set, it is simply omitted from the encoded record. From this you can see that field tags are critical to the meaning of the encoded data. You can change the name of a field in the schema, since the encoded data never refers to field names, but you cannot change a field’s tag, since that would make all existing encoded data invalid.

您可以向架构添加新字段,前提是为每个字段指定新的标签号。如果旧代码(不知道您添加的新标签号)尝试读取新代码写入的数据,包括带有它无法识别的标签号的新字段,它可以简单地忽略该字段。数据类型注释允许解析器确定需要跳过多少字节。这保持了向前兼容性:旧代码可以读取新代码写入的记录。

You can add new fields to the schema, provided that you give each field a new tag number. If old code (which doesn’t know about the new tag numbers you added) tries to read data written by new code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field. The datatype annotation allows the parser to determine how many bytes it needs to skip. This maintains forward compatibility: old code can read records that were written by new code.

向后兼容性怎么样?只要每个字段都有唯一的标签号,新代码总是可以读取旧数据,因为标签号仍然具有相同的含义。唯一的细节是,如果添加新字段,则不能将其设为必填字段。如果您要添加一个字段并将其设置为必填字段,那么如果新代码读取旧代码写入的数据,则该检查将会失败,因为旧代码不会写入您添加的新字段。因此,为了保持向后兼容性,在架构初始部署后添加的每个字段都必须是可选的或具有默认值。

What about backward compatibility? As long as each field has a unique tag number, new code can always read old data, because the tag numbers still have the same meaning. The only detail is that if you add a new field, you cannot make it required. If you were to add a field and make it required, that check would fail if new code read data written by old code, because the old code will not have written the new field that you added. Therefore, to maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value.

删除字段就像添加字段一样,向后和向前兼容性问题相反。这意味着您只能删除可选字段(必填字段永远无法删除),并且您永远不能再次使用相同的标签编号(因为您可能仍然在包含旧标签编号的地方写入数据,并且该字段必须被新代码忽略)。

Removing a field is just like adding a field, with backward and forward compatibility concerns reversed. That means you can only remove a field that is optional (a required field can never be removed), and you can never use the same tag number again (because you may still have data written somewhere that includes the old tag number, and that field must be ignored by new code).

数据类型和模式演变

Datatypes and schema evolution

更改字段的数据类型怎么样?这也许是可能的(请查看文档以了解详细信息),但存在值丢失精度或被截断的风险。例如,假设您将 32 位整数更改为 64 位整数。新代码可以轻松读取旧代码写入的数据,因为解析器可以用零填充任何丢失的位。但是,如果旧代码读取新代码写入的数据,旧代码仍然使用 32 位变量来保存该值。如果解码后的 64 位值不适合 32 位,则会被截断。

What about changing the datatype of a field? That may be possible—check the documentation for details—but there is a risk that values will lose precision or get truncated. For example, say you change a 32-bit integer into a 64-bit integer. New code can easily read data written by old code, because the parser can fill in any missing bits with zeros. However, if old code reads data written by new code, the old code is still using a 32-bit variable to hold the value. If the decoded 64-bit value won’t fit in 32 bits, it will be truncated.

Protocol Buffers 的一个有趣的细节是它没有列表或数组数据类型,而是有一个字段repeated标记(这是与required和一起的第三个选项optional)。正如您在图 4-4中所看到的,字段的编码repeated正如其表面所言:相同的字段标记只是在记录中出现多次。这样做的好处是可以将optional(单值)字段更改为repeated(多值)字段。读取旧数据的新代码会看到一个包含零个或一个元素的列表(取决于该字段是否存在);读取新数据的旧代码只能看到列表的最后一个元素。

A curious detail of Protocol Buffers is that it does not have a list or array datatype, but instead has a repeated marker for fields (which is a third option alongside required and optional). As you can see in Figure 4-4, the encoding of a repeated field is just what it says on the tin: the same field tag simply appears multiple times in the record. This has the nice effect that it’s okay to change an optional (single-valued) field into a repeated (multi-valued) field. New code reading old data sees a list with zero or one elements (depending on whether the field was present); old code reading new data sees only the last element of the list.

Thrift 有一个专用的列表数据类型,它用列表元素的数据类型进行参数化。这不允许像 Protocol Buffers 那样从单值到多值的演变,但它具有支持嵌套列表的优点。

Thrift has a dedicated list datatype, which is parameterized with the datatype of the list elements. This does not allow the same evolution from single-valued to multi-valued as Protocol Buffers does, but it has the advantage of supporting nested lists.

阿夫罗

Avro

Apache Avro [ 20 ] 是另一种二进制编码格式,有趣的是它与 Protocol Buffers 和 Thrift 不同。它于 2009 年作为 Hadoop 的子项目启动,因为 Thrift 不太适合 Hadoop 的用例 [ 21 ]。

Apache Avro [20] is another binary encoding format that is interestingly different from Protocol Buffers and Thrift. It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop’s use cases [21].

Avro 还使用模式来指定正在编码的数据的结构。它有两种模式语言:一种(Avro IDL)用于人工编辑,另一种(基于 JSON)更容易机器可读。

Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily machine-readable.

我们的示例架构是用 Avro IDL 编写的,可能如下所示:

Our example schema, written in Avro IDL, might look like this:

record Person {
    string               userName;
    union { null, long } favoriteNumber = null;
    array<string>        interests;
}
record Person {
    string               userName;
    union { null, long } favoriteNumber = null;
    array<string>        interests;
}

该架构的等效 JSON 表示如下:

The equivalent JSON representation of that schema is as follows:

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",       "type": "string"},
        {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
        {"name": "interests",      "type": {"type": "array", "items": "string"}}
    ]
}
{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",       "type": "string"},
        {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
        {"name": "interests",      "type": {"type": "array", "items": "string"}}
    ]
}

首先,请注意模式中没有标签号。如果我们使用此模式对示例记录(示例 4-1)进行编码,则 Avro 二进制编码只有 32 个字节长——这是我们见过的所有编码中最紧凑的。编码字节序列的分解如图 4-5所示。

First of all, notice that there are no tag numbers in the schema. If we encode our example record (Example 4-1) using this schema, the Avro binary encoding is just 32 bytes long—the most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown in Figure 4-5.

如果检查字节序列,您会发现没有任何内容可以识别字段或其数据类型。编码仅由连接在一起的值组成。字符串只是一个长度前缀,后跟 UTF-8 字节,但编码数据中没有任何内容告诉您它是一个字符串。它也可以是一个整数,或者完全是其他东西。整数使用可变长度编码进行编码(与 Thrift 的 CompactProtocol 相同)。

If you examine the byte sequence, you can see that there is nothing to identify fields or their datatypes. The encoding simply consists of values concatenated together. A string is just a length prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that tells you that it is a string. It could just as well be an integer, or something else entirely. An integer is encoded using a variable-length encoding (the same as Thrift’s CompactProtocol).

迪迪亚0405
图 4-5。使用 Avro 编码的示例记录。

要解析二进制数据,您可以按照字段在架构中出现的顺序遍历字段,并使用架构告诉您每个字段的数据类型。这意味着只有读取数据的代码使用与写入数据的代码完全相同的模式,才能正确解码二进制数据。读取器和写入器之间的架构中的任何不匹配都意味着数据解码不正确。

To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field. This means that the binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly decoded data.

那么,Avro 如何支持模式演化呢?

So, how does Avro support schema evolution?

作者的图式和读者的图式

The writer’s schema and the reader’s schema

使用 Avro,当应用程序想要对某些数据进行编码(将其写入文件或数据库、通过网络发送等)时,它会使用它所知道的任何版本的模式对数据进行编码,例如,模式可以编译到应用程序中。这被称为 作者的模式

With Avro, when an application wants to encode some data (to write it to a file or database, to send it over the network, etc.), it encodes the data using whatever version of the schema it knows about—for example, that schema may be compiled into the application. This is known as the writer’s schema.

当应用程序想要解码某些数据(从文件或数据库读取数据,从网络接收数据等)时,它期望数据处于某种模式中,这称为读取器模式。这就是应用程序代码所依赖的模式——代码可能是在应用程序的构建过程中从该模式生成的。

When an application wants to decode some data (read it from a file or database, receive it from the network, etc.), it is expecting the data to be in some schema, which is known as the reader’s schema. That is the schema the application code is relying on—code may have been generated from that schema during the application’s build process.

Avro 的关键思想是编写者的模式和读者的模式不必相同——它们只需要兼容即可。解码(读取)数据时,Avro 库通过并排查看写入者模式和读取者模式并将数据从写入者模式转换为读取者模式来解决差异。Avro 规范 [ 20 ] 准确定义了该分辨率的工作原理,如图 4-6所示。

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible. When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema. The Avro specification [20] defines exactly how this resolution works, and it is illustrated in Figure 4-6.

例如,如果写入者的模式和读取者的模式的字段顺序不同,则没有问题,因为模式解析按字段名称匹配字段。如果读取数据的代码遇到出现在写入者架构中但未出现在读取者架构中的字段,则会忽略该字段。如果读取数据的代码需要某个字段,但写入者的模式不包含该名称的字段,则将使用读取者模式中声明的默认值填充该字段。

For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a different order, because the schema resolution matches up the fields by field name. If the code reading the data encounters a field that appears in the writer’s schema but not in the reader’s schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.

迪迪亚0406
图 4-6。Avro 读取器解决了写入器模式和读取器模式之间的差异。

模式演化规则

Schema evolution rules

使用 Avro,前向兼容性意味着您可以将新版本的架构作为写入器,将旧版本的架构作为读取器。相反,向后兼容性意味着您可以将新版本的模式作为读取器,将旧版本作为写入器。

With Avro, forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader. Conversely, backward compatibility means that you can have a new version of the schema as reader and an old version as writer.

为了保持兼容性,您只能添加或删除具有默认值的字段。(我们的 Avro 架构中的字段 favoriteNumber的默认值为null。)例如,假设您添加一个具有默认值的字段,那么这个新字段存在于新架构中,但不存在于旧架构中。当使用新模式的读取器读取使用旧模式写入的记录时,将填充缺失字段的默认值。

To maintain compatibility, you may only add or remove a field that has a default value. (The field favoriteNumber in our Avro schema has a default value of null.) For example, say you add a field with a default value, so this new field exists in the new schema but not the old one. When a reader using the new schema reads a record written with the old schema, the default value is filled in for the missing field.

如果您要添加一个没有默认值的字段,新的读取器将无法读取旧写入器写入的数据,因此您将破坏向后兼容性。如果您要删除没有默认值的字段,旧的读取器将无法读取新写入器写入的数据,因此您将破坏前向兼容性。

If you were to add a field that has no default value, new readers wouldn’t be able to read data written by old writers, so you would break backward compatibility. If you were to remove a field that has no default value, old readers wouldn’t be able to read data written by new writers, so you would break forward compatibility.

在某些编程语言中,null是任何变量可接受的默认值,但在 Avro 中并非如此:如果要允许字段为空,则必须使用联合类型。例如, union { null, long, string } field;表示field可以是数字、字符串或 null。null如果它是联合的分支之一,则只能用作默认值。iv这比默认情况下所有内容都可以为空要冗长一些,但它通过明确说明什么可以为空、什么不能为空来帮助防止错误 [ 22 ]。

In some programming languages, null is an acceptable default for any variable, but this is not the case in Avro: if you want to allow a field to be null, you have to use a union type. For example, union { null, long, string } field; indicates that field can be a number, or a string, or null. You can only use null as a default value if it is one of the branches of the union.iv This is a little more verbose than having everything nullable by default, but it helps prevent bugs by being explicit about what can and cannot be null [22].

因此,Avro 没有像 Protocol Buffers 和 Thrift 那样具有optional和标记(它有联合类型和默认值)。required

Consequently, Avro doesn’t have optional and required markers in the same way as Protocol Buffers and Thrift do (it has union types and default values instead).

更改字段的数据类型是可能的,前提是 Avro 可以转换类型。更改字段的名称是可能的,但有点棘手:读取器的模式可以包含字段名称的别名,因此它可以将旧写入器的模式字段名称与别名进行匹配。这意味着更改字段名称是向后兼容的,但不是向前兼容的。同样,向联合类型添加分支是向后兼容的,但不向前兼容。

Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the name of a field is possible but a little tricky: the reader’s schema can contain aliases for field names, so it can match an old writer’s schema field names against the aliases. This means that changing a field name is backward compatible but not forward compatible. Similarly, adding a branch to a union type is backward compatible but not forward compatible.

但作者的图式是什么呢?

But what is the writer’s schema?

到目前为止,我们忽略了一个重要的问题:读者如何知道作者对特定数据进行编码的模式?我们不能只在每条记录中包含整个模式,因为模式可能比编码数据大得多,从而使二进制编码节省的所有空间都变得徒劳。

There is an important question that we’ve glossed over so far: how does the reader know the writer’s schema with which a particular piece of data was encoded? We can’t just include the entire schema with every record, because the schema would likely be much bigger than the encoded data, making all the space savings from the binary encoding futile.

答案取决于 Avro 的使用环境。举几个例子:

The answer depends on the context in which Avro is being used. To give a few examples:

包含大量记录的大文件
Large file with lots of records

Avro 的常见用途(尤其是在 Hadoop 环境中)是存储包含数百万条记录的大型文件,所有记录都使用相同的模式进行编码。(我们将在第 10 章中讨论这种情况。)在这种情况下,该文件的编写者可以在文件的开头仅包含编写者的模式一次。Avro 指定一种文件格式(对象容器文件)来执行此操作。

A common use for Avro—especially in the context of Hadoop—is for storing a large file containing millions of records, all encoded with the same schema. (We will discuss this kind of situation in Chapter 10.) In this case, the writer of that file can just include the writer’s schema once at the beginning of the file. Avro specifies a file format (object container files) to do this.

具有单独写入记录的数据库
Database with individually written records

在数据库中,不同的记录可能会在不同的时间点使用不同的写入者模式写入 - 您不能假设所有记录都具有相同的模式。最简单的解决方案是在每个编码记录的开头包含一个版本号,并在数据库中保留架构版本列表。读者可以获取记录,提取版本号,然后从数据库中获取该版本号的作者模式。使用该编写者的模式,它可以解码记录的其余部分。 (例如,浓缩咖啡 [ 23 ] 就是这样工作的。)

In a database, different records may be written at different points in time using different writer’s schemas—you cannot assume that all the records will have the same schema. The simplest solution is to include a version number at the beginning of every encoded record, and to keep a list of schema versions in your database. A reader can fetch a record, extract the version number, and then fetch the writer’s schema for that version number from the database. Using that writer’s schema, it can decode the rest of the record. (Espresso [23] works this way, for example.)

通过网络连接发送记录
Sending records over a network connection

当两个进程通过双向网络连接进行通信时,它们可以协商连接设置的架构版本,然后在连接的生命周期内使用该架构。Avro RPC 协议(请参阅“通过服务的数据流:REST 和 RPC”)的工作原理如下。

When two processes are communicating over a bidirectional network connection, they can negotiate the schema version on connection setup and then use that schema for the lifetime of the connection. The Avro RPC protocol (see “Dataflow Through Services: REST and RPC”) works like this.

模式版本数据库在任何情况下都是有用的,因为它充当文档并让您有机会检查模式兼容性[ 24 ]。作为版本号,您可以使用简单的递增整数,也可以使用架构的哈希值。

A database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema compatibility [24]. As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.

动态生成的模式

Dynamically generated schemas

与 Protocol Buffers 和 Thrift 相比,Avro 方法的优点之一是模式不包含任何标签号。但这为什么很重要?在模式中保留几个数字有什么问题?

One advantage of Avro’s approach, compared to Protocol Buffers and Thrift, is that the schema doesn’t contain any tag numbers. But why is this important? What’s the problem with keeping a couple of numbers in the schema?

不同之处在于 Avro 对动态生成的模式更友好。例如,假设您有一个关系数据库,想要将其内容转储到文件中,并且想要使用二进制格式来避免上述文本格式(JSON、CSV、SQL)的问题。如果您使用 Avro,您可以相当轻松地从关系模式生成 Avro 模式(以我们之前看到的 JSON 表示形式),并使用该模式对数据库内容进行编码,将其全部转储到 Avro 对象容器文件 [25 ]。您为每个数据库表生成一个记录模式,并且每一列都成为该记录中的一个字段。数据库中的列名称映射到 Avro 中的字段名称。

The difference is that Avro is friendlier to dynamically generated schemas. For example, say you have a relational database whose contents you want to dump to a file, and you want to use a binary format to avoid the aforementioned problems with textual formats (JSON, CSV, SQL). If you use Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the relational schema and encode the database contents using that schema, dumping it all to an Avro object container file [25]. You generate a record schema for each database table, and each column becomes a field in that record. The column name in the database maps to the field name in Avro.

现在,如果数据库架构发生更改(例如,表添加了一列并删除了一列),您可以从更新的数据库架构生成新的 Avro 架构,并在新的 Avro 架构中导出数据。数据导出过程不需要关注架构更改,只需在每次运行时进行架构转换即可。任何读取新数据文件的人都会看到记录的字段已更改,但由于字段是通过名称标识的,因此更新后的写入者模式仍然可以与旧读取者的模式相匹配。

Now, if the database schema changes (for example, a table has one column added and one column removed), you can just generate a new Avro schema from the updated database schema and export data in the new Avro schema. The data export process does not need to pay any attention to the schema change—it can simply do the schema conversion every time it runs. Anyone who reads the new data files will see that the fields of the record have changed, but since the fields are identified by name, the updated writer’s schema can still be matched up with the old reader’s schema.

相比之下,如果您为此目的使用 Thrift 或 Protocol Buffers,则可能必须手动分配字段标签:每次数据库模式更改时,管理员都必须手动更新从数据库列名称到字段标签的映射。(也许可以自动执行此操作,但模式生成器必须非常小心,不要分配以前使用的字段标签。)这种动态生成的模式根本不是 Thrift 或 Protocol Buffers 的设计目标,而它是对于阿夫罗。

By contrast, if you were using Thrift or Protocol Buffers for this purpose, the field tags would likely have to be assigned by hand: every time the database schema changes, an administrator would have to manually update the mapping from database column names to field tags. (It might be possible to automate this, but the schema generator would have to be very careful to not assign previously used field tags.) This kind of dynamically generated schema simply wasn’t a design goal of Thrift or Protocol Buffers, whereas it was for Avro.

代码生成和动态类型语言

Code generation and dynamically typed languages

Thrift 和 Protocol Buffers 依赖于代码生成:定义模式后,您可以生成用您选择的编程语言实现该模式的代码。这在 Java、C++ 或 C# 等静态类型语言中非常有用,因为它允许将高效的内存结构用于解码数据,并且在编写访问数据结构的程序时允许在 IDE 中进行类型检查和自动完成。

Thrift and Protocol Buffers rely on code generation: after a schema has been defined, you can generate code that implements this schema in a programming language of your choice. This is useful in statically typed languages such as Java, C++, or C#, because it allows efficient in-memory structures to be used for decoded data, and it allows type checking and autocompletion in IDEs when writing programs that access the data structures.

在 JavaScript、Ruby 或 Python 等动态类型编程语言中,生成代码没有多大意义,因为没有编译时类型检查器可以满足。在这些语言中,代码生成通常不受欢迎,因为它们避免了显式编译步骤。此外,在动态生成模式(例如从数据库表生成的 Avro 模式)的情况下,代码生成是获取数据的不必要的障碍。

In dynamically typed programming languages such as JavaScript, Ruby, or Python, there is not much point in generating code, since there is no compile-time type checker to satisfy. Code generation is often frowned upon in these languages, since they otherwise avoid an explicit compilation step. Moreover, in the case of a dynamically generated schema (such as an Avro schema generated from a database table), code generation is an unnecessarily obstacle to getting to the data.

Avro 为静态类型编程语言提供了可选的代码生成,但它也可以在不生成任何代码的情况下使用。如果您有一个对象容器文件(其中嵌入了编写器的架构),您可以简单地使用 Avro 库打开它,并以与查看 JSON 文件相同的方式查看数据。该文件是自描述的,因为它包含所有必要的元数据。

Avro provides optional code generation for statically typed programming languages, but it can be used just as well without any code generation. If you have an object container file (which embeds the writer’s schema), you can simply open it using the Avro library and look at the data in the same way as you could look at a JSON file. The file is self-describing since it includes all the necessary metadata.

此属性与动态类型数据处理语言(如 Apache Pig [ 26 ] )结合使用特别有用。在 Pig 中,您只需打开一些 Avro 文件,开始分析它们,并将派生数据集写入 Avro 格式的输出文件,甚至无需考虑模式。

This property is especially useful in conjunction with dynamically typed data processing languages like Apache Pig [26]. In Pig, you can just open some Avro files, start analyzing them, and write derived datasets to output files in Avro format without even thinking about schemas.

模式的优点

The Merits of Schemas

正如我们所看到的,Protocol Buffers、Thrift 和 Avro 都使用模式来描述二进制编码格式。它们的模式语言比 XML Schema 或 JSON Schema 简单得多,支持更详细的验证规则(例如,“该字段的字符串值必须与该正则表达式匹配”或“该字段的整数值必须在 0 到 100 之间) ”)。由于 Protocol Buffers、Thrift 和 Avro 更易于实现和使用,它们已经发展到支持相当广泛的编程语言。

As we saw, Protocol Buffers, Thrift, and Avro all use a schema to describe a binary encoding format. Their schema languages are much simpler than XML Schema or JSON Schema, which support much more detailed validation rules (e.g., “the string value of this field must match this regular expression” or “the integer value of this field must be between 0 and 100”). As Protocol Buffers, Thrift, and Avro are simpler to implement and simpler to use, they have grown to support a fairly wide range of programming languages.

这些编码所基于的想法绝不是新的。例如,它们与 ASN.1 有很多共同点,ASN.1 是一种于 1984 年首次标准化的模式定义语言 [ 27 ]。它被用来定义各种网络协议,并且它的二进制编码(DER)仍然被用来编码SSL证书(X.509),例如[ 28 ]。ASN.1 支持使用标签号进行模式演化,类似于 Protocol Buffers 和 Thrift [ 29 ]。然而,它也非常复杂且文档不完善,因此ASN.1 对于新应用程序来说可能不是一个好的选择。

The ideas on which these encodings are based are by no means new. For example, they have a lot in common with ASN.1, a schema definition language that was first standardized in 1984 [27]. It was used to define various network protocols, and its binary encoding (DER) is still used to encode SSL certificates (X.509), for example [28]. ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers and Thrift [29]. However, it’s also very complex and badly documented, so ASN.1 is probably not a good choice for new applications.

许多数据系统还为其数据实现某种专有的二进制编码。例如,大多数关系数据库都有一个网络协议,您可以通过该协议向数据库发送查询并获取响应。这些协议通常特定于特定数据库,并且数据库供应商提供将来自数据库的网络协议的响应解码为内存中数据结构的驱动程序(例如,使用ODBC或JDBC API)。

Many data systems also implement some kind of proprietary binary encoding for their data. For example, most relational databases have a network protocol over which you can send queries to the database and get back responses. Those protocols are generally specific to a particular database, and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs) that decodes responses from the database’s network protocol into in-memory data structures.

因此,我们可以看到,虽然 JSON、XML 和 CSV 等文本数据格式很普遍,但基于模式的二进制编码也是一个可行的选择。它们有许多不错的特性:

So, we can see that although textual data formats such as JSON, XML, and CSV are widespread, binary encodings based on schemas are also a viable option. They have a number of nice properties:

  • 它们比各种“二进制 JSON”变体更加紧凑,因为它们可以省略编码数据中的字段名称。

  • They can be much more compact than the various “binary JSON” variants, since they can omit field names from the encoded data.

  • 架构是一种有价值的文档形式,并且由于解码需要架构,因此您可以确保它是最新的(而手动维护的文档可能很容易与实际情况背离)。

  • The schema is a valuable form of documentation, and because the schema is required for decoding, you can be sure that it is up to date (whereas manually maintained documentation may easily diverge from reality).

  • 保留架构数据库允许您在部署任何内容之前检查架构更改的向前和向后兼容性。

  • Keeping a database of schemas allows you to check forward and backward compatibility of schema changes, before anything is deployed.

  • 对于静态类型编程语言的用户来说,从模式生成代码的能力非常有用,因为它可以在编译时进行类型检查。

  • For users of statically typed programming languages, the ability to generate code from the schema is useful, since it enables type checking at compile time.

总之,模式演化提供了与无模式/读取模式 JSON 数据库提供的相同的灵活性(请参阅“文档模型中的模式灵活性”),同时还为您的数据提供更好的保证和更好的工具。

In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON databases provide (see “Schema flexibility in the document model”), while also providing better guarantees about your data and better tooling.

数据流模式

Modes of Dataflow

在本章开头我们说过,每当你想将一些数据发送到另一个不共享内存的进程时——例如,每当你想通过网络发送数据或将其写入文件时——你需要将其编码为字节序列。然后我们讨论了用于执行此操作的各种不同的编码。

At the beginning of this chapter we said that whenever you want to send some data to another process with which you don’t share memory—for example, whenever you want to send data over the network or write it to a file—you need to encode it as a sequence of bytes. We then discussed a variety of different encodings for doing this.

我们讨论了向前和向后兼容性,这对于可演进性很重要(通过允许您独立升级系统的不同部分,而不必立即更改所有内容,从而使更改变得容易)。兼容性是对数据进行编码的一个进程与对数据进行解码的另一个进程之间的关系。

We talked about forward and backward compatibility, which are important for evolvability (making change easy by allowing you to upgrade different parts of your system independently, and not having to change everything at once). Compatibility is a relationship between one process that encodes the data, and another process that decodes it.

这是一个相当抽象的想法——数据可以通过多种方式从一个进程流向另一个进程。谁对数据进行编码,谁对其进行解码?在本章的其余部分中,我们将探讨数据在进程之间流动的一些最常见的方式:

That’s a fairly abstract idea—there are many ways data can flow from one process to another. Who encodes the data, and who decodes it? In the rest of this chapter we will explore some of the most common ways how data flows between processes:

通过数据库的数据流

Dataflow Through Databases

在数据库中,写入数据库的进程对数据进行编码,从数据库读取的进程对其进行解码。可能只有一个进程访问数据库,在这种情况下,读者只是同一进程的更高版本 - 在这种情况下,您可以认为在数据库中存储某些内容就像向未来的自己发送消息一样

In a database, the process that writes to the database encodes the data, and the process that reads from the database decodes it. There may just be a single process accessing the database, in which case the reader is simply a later version of the same process—in that case you can think of storing something in the database as sending a message to your future self.

向后兼容显然是必要的;否则未来的你将无法解码你之前写的内容。

Backward compatibility is clearly necessary here; otherwise your future self won’t be able to decode what you previously wrote.

一般来说,多个不同的进程同时访问数据库是很常见的。这些进程可能是多个不同的应用程序或服务,或者它们可能只是同一服务的多个实例(为了可扩展性或容错而并行运行)。无论哪种方式,在应用程序发生变化的环境中,访问数据库的某些进程可能会运行较新的代码,而某些进程将运行较旧的代码 - 例如,因为当前正在滚动升级中部署新版本,因此一些实例已更新,而另一些实例尚未更新。

In general, it’s common for several different processes to be accessing a database at the same time. Those processes might be several different applications or services, or they may simply be several instances of the same service (running in parallel for scalability or fault tolerance). Either way, in an environment where the application is changing, it is likely that some processes accessing the database will be running newer code and some will be running older code—for example because a new version is currently being deployed in a rolling upgrade, so some instances have been updated while others haven’t yet.

这意味着数据库中的值可能由较新版本的代码写入,随后由仍在运行的较旧版本的代码读取。因此,数据库通常也需要向前兼容。

This means that a value in the database may be written by a newer version of the code, and subsequently read by an older version of the code that is still running. Thus, forward compatibility is also often required for databases.

然而,还有一个额外的障碍。假设您将一个字段添加到记录模式,并且较新的代码将该新字段的值写入数据库。随后,旧版本的代码(尚不知道新字段)读取记录,更新它,然后将其写回。在这种情况下,理想的行为通常是旧代码保持新字段完整,即使它无法被解释。

However, there is an additional snag. Say you add a field to a record schema, and the newer code writes a value for that new field to the database. Subsequently, an older version of the code (which doesn’t yet know about the new field) reads the record, updates it, and writes it back. In this situation, the desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t be interpreted.

前面讨论的编码格式支持这种未知字段的保存,但有时您需要在应用程序级别小心,如图4-7所示。例如,如果您将数据库值解码为应用程序中的模型对象,然后重新编码这些模型对象,则未知字段可能会在该转换过程中丢失。解决这个问题并不困难;你只需要意识到这一点。

The encoding formats discussed previously support such preservation of unknown fields, but sometimes you need to take care at an application level, as illustrated in Figure 4-7. For example, if you decode a database value into model objects in the application, and later reencode those model objects, the unknown field might be lost in that translation process. Solving this is not a hard problem; you just need to be aware of it.

迪迪亚0407
图 4-7。当旧版本的应用程序更新先前由新版本的应用程序写入的数据时,如果不小心,数据可能会丢失。

不同时间写入不同值

Different values written at different times

数据库通常允许随时更新任何值。这意味着在单个数据库中,您可能拥有一些五毫秒前写入的值,以及一些五年前写入的值。

A database generally allows any value to be updated at any time. This means that within a single database you may have some values that were written five milliseconds ago, and some values that were written five years ago.

当您部署应用程序的新版本(至少是服务器端应用程序)时,您可以在几分钟内用新版本完全替换旧版本。数据库内容则不然:五年前的数据仍然会以原始编码形式存在,除非您从那时起明确重写了它。这种观察有时被总结为数据比代码更长寿

When you deploy a new version of your application (of a server-side application, at least), you may entirely replace the old version with the new version within a few minutes. The same is not true of database contents: the five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then. This observation is sometimes summed up as data outlives code.

将数据重写(迁移)到新模式当然是可能的,但在大型数据集上执行此操作的成本很高,因此大多数数据库会尽可能避免这样做。大多数关系数据库允许简单的架构更改,例如添加具有null默认值的新列,而无需重写现有数据。v读取旧行时,数据库会null为磁盘上编码数据中缺少的任何列填充 s。 LinkedIn 的文档数据库 Espresso 使用 Avro 进行存储,使其能够使用 Avro 的模式演化规则 [ 23 ]。

Rewriting (migrating) data into a new schema is certainly possible, but it’s an expensive thing to do on a large dataset, so most databases avoid it if possible. Most relational databases allow simple schema changes, such as adding a new column with a null default value, without rewriting existing data.v When an old row is read, the database fills in nulls for any columns that are missing from the encoded data on disk. LinkedIn’s document database Espresso uses Avro for storage, allowing it to use Avro’s schema evolution rules [23].

因此,模式演化允许整个数据库看起来好像是用单个模式编码的,即使底层存储可能包含用该模式的各种历史版本编码的记录。

Schema evolution thus allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with various historical versions of the schema.

档案存储

Archival storage

也许您会不时拍摄数据库快照,例如用于备份目的或加载到数据仓库中(请参阅“数据仓库”)。在这种情况下,数据转储通常会使用最新的模式进行编码,即使源数据库中的原始编码包含来自不同时代的模式版本的混合。由于无论如何都要复制数据,因此您最好对数据副本进行一致的编码。

Perhaps you take a snapshot of your database from time to time, say for backup purposes or for loading into a data warehouse (see “Data Warehousing”). In this case, the data dump will typically be encoded using the latest schema, even if the original encoding in the source database contained a mixture of schema versions from different eras. Since you’re copying the data anyway, you might as well encode the copy of the data consistently.

由于数据转储是一次性写入的并且此后不可变,因此 Avro 对象容器文件等格式非常适合。这也是以分析友好的面向列的格式(例如 Parquet)对数据进行编码的好机会(请参阅“列压缩”)。

As the data dump is written in one go and is thereafter immutable, formats like Avro object container files are a good fit. This is also a good opportunity to encode the data in an analytics-friendly column-oriented format such as Parquet (see “Column Compression”).

第 10 章中,我们将更多地讨论在档案存储中使用数据。

In Chapter 10 we will talk more about using data in archival storage.

通过服务的数据流:REST 和 RPC

Dataflow Through Services: REST and RPC

当您的进程需要通过网络进行通信时,可以采用几种不同的方式来安排该通信。最常见的安排是有两个角色:客户端服务器。服务器通过网络公开 API,客户端可以连接到服务器以向该 API 发出请求。服务器公开的 API 称为服务

When you have processes that need to communicate over a network, there are a few different ways of arranging that communication. The most common arrangement is to have two roles: clients and servers. The servers expose an API over the network, and the clients can connect to the servers to make requests to that API. The API exposed by the server is known as a service.

Web 是这样工作的:客户端(Web 浏览器)向 Web 服务器发出请求,请求GET下载 HTML、CSS、JavaScript、图像等,并请求POST向服务器提交数据。API 由一组标准化协议和数据格式(HTTP、URL、SSL/TLS、HTML 等)组成。由于网络浏览器、网络服务器和网站作者大多都同意这些标准,因此您可以使用任何网络浏览器访问任何网站(至少在理论上!)。

The web works this way: clients (web browsers) make requests to web servers, making GET requests to download HTML, CSS, JavaScript, images, etc., and making POST requests to submit data to the server. The API consists of a standardized set of protocols and data formats (HTTP, URLs, SSL/TLS, HTML, etc.). Because web browsers, web servers, and website authors mostly agree on these standards, you can use any web browser to access any website (at least in theory!).

Web 浏览器不是唯一的客户端类型。例如,在移动设备或台式计算机上运行的本机应用程序也可以向服务器发出网络请求,在 Web 浏览器内运行的客户端 JavaScript 应用程序可以使用 XMLHttpRequest 成为 HTTP 客户端(此技术称为阿贾克斯[ 30 ])。在这种情况下,服务器的响应通常不是用于向人类显示的 HTML,而是采用便于客户端应用程序代码进一步处理的编码数据(例如 JSON)。尽管 HTTP 可以用作传输协议,但顶部实现的 API 是特定于应用程序的,并且客户端和服务器需要就该 API 的细节达成一致。

Web browsers are not the only type of client. For example, a native app running on a mobile device or a desktop computer can also make network requests to a server, and a client-side JavaScript application running inside a web browser can use XMLHttpRequest to become an HTTP client (this technique is known as Ajax [30]). In this case, the server’s response is typically not HTML for displaying to a human, but rather data in an encoding that is convenient for further processing by the client-side application code (such as JSON). Although HTTP may be used as the transport protocol, the API implemented on top is application-specific, and the client and server need to agree on the details of that API.

此外,服务器本身可以是另一个服务的客户端(例如,典型的 Web 应用程序服务器充当数据库的客户端)。这种方法通常用于按功能区域将大型应用程序分解为较小的服务,以便一个服务在需要另一个服务的某些功能或数据时向另一个服务发出请求。这种构建应用程序的方式传统上被称为面向服务的架构(SOA),最近经过改进并重新命名为微服务架构 [ 31 , 32 ]。

Moreover, a server can itself be a client to another service (for example, a typical web app server acts as client to a database). This approach is often used to decompose a large application into smaller services by area of functionality, such that one service makes a request to another when it requires some functionality or data from that other service. This way of building applications has traditionally been called a service-oriented architecture (SOA), more recently refined and rebranded as microservices architecture [31, 32].

在某些方面,服务类似于数据库:它们通常允许客户端提交和查询数据。然而,虽然数据库允许使用我们在 第 2 章中讨论的查询语言进行任意查询,但服务公开了特定于应用程序的 API,该 API 只允许由服务的业务逻辑(应用程序代码)预先确定的输入和输出 [33 ]。此限制提供了一定程度的封装:服务可以对客户端可以做什么和不能做什么施加细粒度的限制。

In some ways, services are similar to databases: they typically allow clients to submit and query data. However, while databases allow arbitrary queries using the query languages we discussed in Chapter 2, services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic (application code) of the service [33]. This restriction provides a degree of encapsulation: services can impose fine-grained restrictions on what clients can and cannot do.

面向服务/微服务架构的一个关键设计目标是通过使服务独立部署和可演化来使应用程序更容易更改和维护。例如,每项服务应该由一个团队拥有,并且该团队应该能够经常发布该服务的新版本,而无需与其他团队协调。换句话说,我们应该期望新旧版本的服务器和客户端同时运行,因此服务器和客户端使用的数据编码必须跨版本的服务 API 兼容——这正是我们一直在谈论的关于本章。

A key design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable. For example, each service should be owned by one team, and that team should be able to release new versions of the service frequently, without having to coordinate with other teams. In other words, we should expect old and new versions of servers and clients to be running at the same time, and so the data encoding used by servers and clients must be compatible across versions of the service API—precisely what we’ve been talking about in this chapter.

网页服务

Web services

当 HTTP 用作与服务通信的底层协议时,它被称为Web 服务。这可能有点用词不当,因为 Web 服务不仅在 Web 上使用,而且在多种不同的环境中使用。例如:

When HTTP is used as the underlying protocol for talking to the service, it is called a web service. This is perhaps a slight misnomer, because web services are not only used on the web, but in several different contexts. For example:

  1. 在用户设备上运行的客户端应用程序(例如,移动设备上的本机应用程序或使用 Ajax 的 JavaScript Web 应用程序)通过 HTTP 向服务发出请求。这些请求通常通过公共互联网进行。

  2. A client application running on a user’s device (e.g., a native app on a mobile device, or JavaScript web app using Ajax) making requests to a service over HTTP. These requests typically go over the public internet.

  3. 作为面向服务/微服务架构的一部分,一项服务向同一组织拥有的另一项服务发出请求,该服务通常位于同一数据中心内。(支持这种用例的软件有时称为中间件。)

  4. One service making requests to another service owned by the same organization, often located within the same datacenter, as part of a service-oriented/microservices architecture. (Software that supports this kind of use case is sometimes called middleware.)

  5. 一项服务通常通过互联网向不同组织拥有的服务发出请求。这用于不同组织的后端系统之间的数据交换。此类别包括在线服务提供的公共 API,例如信用卡处理系统或用于共享用户数据访问的 OAuth。

  6. One service making requests to a service owned by a different organization, usually via the internet. This is used for data exchange between different organizations’ backend systems. This category includes public APIs provided by online services, such as credit card processing systems, or OAuth for shared access to user data.

Web 服务有两种流行的方法:RESTSOAP。他们在哲学方面几乎截然相反,并且经常成为各自支持者激烈争论的话题。

There are two popular approaches to web services: REST and SOAP. They are almost diametrically opposed in terms of philosophy, and often the subject of heated debate among their respective proponents.vi

REST 不是一种协议,而是一种建立在 HTTP 原则之上的设计理念 [ 34 , 35 ]。它强调简单的数据格式,使用 URL 来标识资源,并使用 HTTP 功能进行缓存控制、身份验证和内容类型协商。与 SOAP 相比,REST 越来越受欢迎,至少在跨组织服务集成的背景下是如此 [ 36 ],并且通常与微服务相关联 [ 31 ]。按照REST原则设计的API称为RESTful

REST is not a protocol, but rather a design philosophy that builds upon the principles of HTTP [34, 35]. It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication, and content type negotiation. REST has been gaining popularity compared to SOAP, at least in the context of cross-organizational service integration [36], and is often associated with microservices [31]. An API designed according to the principles of REST is called RESTful.

相比之下,SOAP 是一种基于 XML 的协议,用于发出网络 API 请求。vii 虽然它最常通过 HTTP 使用,但它的目标是独立于 HTTP 并避免使用大多数 HTTP 功能。相反,它附带了大量复杂的相关标准(Web 服务框架,称为WS-*),添加了各种功能 [ 37 ]。

By contrast, SOAP is an XML-based protocol for making network API requests.vii Although it is most commonly used over HTTP, it aims to be independent from HTTP and avoids using most HTTP features. Instead, it comes with a sprawling and complex multitude of related standards (the web service framework, known as WS-*) that add various features [37].

SOAP Web 服务的 API 使用基于 XML 的语言(称为 Web 服务描述语言或 WSDL)进行描述。WSDL 支持代码生成,以便客户端可以使用本地类和方法调用(编码为 XML 消息并由框架再次解码)来访问远程服务。这在静态类型编程语言中很有用,但在动态类型编程语言中则不太有用(请参阅“代码生成和动态类型语言”)。

The API of a SOAP web service is described using an XML-based language called the Web Services Description Language, or WSDL. WSDL enables code generation so that a client can access a remote service using local classes and method calls (which are encoded to XML messages and decoded again by the framework). This is useful in statically typed programming languages, but less so in dynamically typed ones (see “Code generation and dynamically typed languages”).

由于 WSDL 并非设计为人类可读的,并且 SOAP 消息通常过于复杂而无法手动构建,因此 SOAP 用户严重依赖工具支持、代码生成和 IDE [38 ]。对于使用 SOAP 供应商不支持的编程语言的用户来说,与 SOAP 服务集成是很困难的。

As WSDL is not designed to be human-readable, and as SOAP messages are often too complex to construct manually, users of SOAP rely heavily on tool support, code generation, and IDEs [38]. For users of programming languages that are not supported by SOAP vendors, integration with SOAP services is difficult.

尽管 SOAP 及其各种扩展表面上是标准化的,但不同供应商的实现之间的互操作性常常会导致问题 [ 39 ]。由于所有这些原因,尽管 SOAP 仍在许多大型企业中使用,但它在大多数小型公司中已经失宠。

Even though SOAP and its various extensions are ostensibly standardized, interoperability between different vendors’ implementations often causes problems [39]. For all of these reasons, although SOAP is still used in many large enterprises, it has fallen out of favor in most smaller companies.

RESTful API 倾向于采用更简单的方法,通常涉及更少的代码生成和自动化工具。OpenAPI(也称为 Swagger [ 40 ])等定义格式可用于描述 RESTful API 并生成文档。

RESTful APIs tend to favor simpler approaches, typically involving less code generation and automated tooling. A definition format such as OpenAPI, also known as Swagger [40], can be used to describe RESTful APIs and produce documentation.

远程过程调用 (RPC) 的问题

The problems with remote procedure calls (RPCs)

Web 服务只是通过网络发出 API 请求的一长串技术的最新体现,其中许多技术受到了广泛的宣传,但存在严重的问题。Enterprise JavaBeans (EJB) 和Java 的远程方法调用(RMI) 仅限于Java。分布式组件对象模型 (DCOM) 仅限于 Microsoft 平台。公共对象请求代理架构(CORBA)过于复杂,并且不提供向后或向前兼容性[ 41 ]。

Web services are merely the latest incarnation of a long line of technologies for making API requests over a network, many of which received a lot of hype but have serious problems. Enterprise JavaBeans (EJB) and Java’s Remote Method Invocation (RMI) are limited to Java. The Distributed Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker Architecture (CORBA) is excessively complex, and does not provide backward or forward compatibility [41].

所有这些都是基于远程过程调用(RPC) 的思想,该思想自 20 世纪 70 年代就已存在 [ 42 ]。RPC 模型尝试使对远程网络服务的请求看起来与在同一进程内调用编程语言中的函数或方法相同(此抽象称为位置透明性)。尽管 RPC 乍一看似乎很方便,但该方法存在根本缺陷 [ 43 , 44 ]。网络请求与本地函数调用有很大不同:

All of these are based on the idea of a remote procedure call (RPC), which has been around since the 1970s [42]. The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process (this abstraction is called location transparency). Although RPC seems convenient at first, the approach is fundamentally flawed [43, 44]. A network request is very different from a local function call:

  • 本地函数调用是可预测的,成功或失败,仅取决于您控制的参数。网络请求是不可预测的:请求或响应可能会由于网络问题而丢失,或者远程计算机可能速度缓慢或不可用,而此类问题完全超出您的控制范围。网络问题很常见,因此您必须预见到它们,例如通过重试失败的请求。

  • A local function call is predictable and either succeeds or fails, depending only on parameters that are under your control. A network request is unpredictable: the request or response may be lost due to a network problem, or the remote machine may be slow or unavailable, and such problems are entirely outside of your control. Network problems are common, so you have to anticipate them, for example by retrying a failed request.

  • 本地函数调用要么返回结果,要么引发异常,要么永远不返回(因为它进入无限循环或进程崩溃)。网络请求还有另一种可能的结果:由于超时,它可能会返回而没有结果。在这种情况下,您根本不知道发生了什么:如果您没有收到远程服务的响应,您就无法知道请求是否通过。(我们将在第 8 章中更详细地讨论这个问题。)

  • A local function call either returns a result, or throws an exception, or never returns (because it goes into an infinite loop or the process crashes). A network request has another possible outcome: it may return without a result, due to a timeout. In that case, you simply don’t know what happened: if you don’t get a response from the remote service, you have no way of knowing whether the request got through or not. (We discuss this issue in more detail in Chapter 8.)

  • 如果您重试失败的网络请求,则可能会发生请求实际上已通过,而只有响应丢失的情况。 在这种情况下,重试将导致该操作被执行多次,除非您在协议中构建了重复数据删除(幂等)机制。本地函数调用不存在这个问题。(我们将在第 11 章中更详细地讨论幂等性。)

  • If you retry a failed network request, it could happen that the requests are actually getting through, and only the responses are getting lost. In that case, retrying will cause the action to be performed multiple times, unless you build a mechanism for deduplication (idempotence) into the protocol. Local function calls don’t have this problem. (We discuss idempotence in more detail in Chapter 11.)

  • 每次调用本地函数时,通常需要大约相同的时间来执行。网络请求比函数调用慢得多,并且其延迟也变化很大:在良好的情况下,它可能会在不到一毫秒的时间内完成,但当网络拥塞或远程服务过载时,可能需要很多秒才能完成完全相同的事情。

  • Every time you call a local function, it normally takes about the same time to execute. A network request is much slower than a function call, and its latency is also wildly variable: at good times it may complete in less than a millisecond, but when the network is congested or the remote service is overloaded it may take many seconds to do exactly the same thing.

  • 当您调用本地函数时,您可以有效地将其传递给本地内存中的对象的引用(指针)。当您发出网络请求时,所有这些参数都需要编码成可以通过网络发送的字节序列。如果参数是数字或字符串等基元,那还可以,但对于较大的对象很快就会出现问题。

  • When you call a local function, you can efficiently pass it references (pointers) to objects in local memory. When you make a network request, all those parameters need to be encoded into a sequence of bytes that can be sent over the network. That’s okay if the parameters are primitives like numbers or strings, but quickly becomes problematic with larger objects.

  • 客户端和服务可能用不同的编程语言实现,因此 RPC 框架必须将数据类型从一种语言转换为另一种语言。这最终可能会很糟糕,因为并非所有语言都具有相同的类型 -例如,回想一下 JavaScript 中数字大于 2 53的问题(请参阅“JSON、XML 和二进制变体”)。这个问题在用单一语言编写的单个进程中不存在。

  • The client and the service may be implemented in different programming languages, so the RPC framework must translate datatypes from one language into another. This can end up ugly, since not all languages have the same types—recall JavaScript’s problems with numbers greater than 253, for example (see “JSON, XML, and Binary Variants”). This problem doesn’t exist in a single process written in a single language.

所有这些因素意味着,尝试让远程服务看起来太像编程语言中的本地对象是没有意义的,因为这是根本不同的事情。REST 的部分吸引力在于它不会试图隐藏它是一种网络协议的事实(尽管这似乎并不能阻止人们在 REST 之上构建 RPC 库)。

All of these factors mean that there’s no point trying to make a remote service look too much like a local object in your programming language, because it’s a fundamentally different thing. Part of the appeal of REST is that it doesn’t try to hide the fact that it’s a network protocol (although this doesn’t seem to stop people from building RPC libraries on top of REST).

RPC 的当前方向

Current directions for RPC

尽管存在这些问题,RPC 并没有消失。各种 RPC 框架都建立在本章提到的所有编码之上:例如,Thrift 和 Avro 都包含 RPC 支持,gRPC 是使用 Protocol Buffers 的 RPC 实现,Finagle 也使用 Thrift,而 Rest.li 使用 JSON HTTP。

Despite all these problems, RPC isn’t going away. Various RPC frameworks have been built on top of all the encodings mentioned in this chapter: for example, Thrift and Avro come with RPC support included, gRPC is an RPC implementation using Protocol Buffers, Finagle also uses Thrift, and Rest.li uses JSON over HTTP.

新一代 RPC 框架更加明确地表明远程请求与本地函数调用不同。例如,Finagle 和 Rest.li 使用futures ( promise ) 来封装可能失败的异步操作。Future 还简化了您需要并行向多个服务发出请求并组合它们的结果的情况 [ 45 ]。gRPC 支持,其中调用不仅包含一个请求和一个响应,还包含随时间变化的一系列请求和响应 [ 46 ]。

This new generation of RPC frameworks is more explicit about the fact that a remote request is different from a local function call. For example, Finagle and Rest.li use futures (promises) to encapsulate asynchronous actions that may fail. Futures also simplify situations where you need to make requests to multiple services in parallel, and combine their results [45]. gRPC supports streams, where a call consists of not just one request and one response, but a series of requests and responses over time [46].

其中一些框架还提供服务发现,即允许客户端找出可以在哪个 IP 地址和端口号上找到特定服务。我们将在“请求路由”中回到这个主题。

Some of these frameworks also provide service discovery—that is, allowing a client to find out at which IP address and port number it can find a particular service. We will return to this topic in “Request Routing”.

具有二进制编码格式的自定义 RPC 协议可以比 REST 上的 JSON 等通用协议实现更好的性能。然而,RESTful API 还有其他显着的优点:它有利于实验和调试(您可以简单地使用 Web 浏览器或命令行工具向它发出请求,无需任何代码curl生成或软件安装),它受到所有应用程序的支持。主流编程语言和平台,并且有一个庞大的可用工具生态系统(服务器、缓存、负载均衡器、代理、防火墙、监控、调试工具、测试工具等)。

Custom RPC protocols with a binary encoding format can achieve better performance than something generic like JSON over REST. However, a RESTful API has other significant advantages: it is good for experimentation and debugging (you can simply make requests to it using a web browser or the command-line tool curl, without any code generation or software installation), it is supported by all mainstream programming languages and platforms, and there is a vast ecosystem of tools available (servers, caches, load balancers, proxies, firewalls, monitoring, debugging tools, testing tools, etc.).

由于这些原因,REST 似乎是公共 API 的主要风格。RPC 框架的主要关注点是同一组织(通常在同一数据中心内)拥有的服务之间的请求。

For these reasons, REST seems to be the predominant style for public APIs. The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.

RPC 的数据编码和演化

Data encoding and evolution for RPC

对于可演进性来说,RPC 客户端和服务器可以独立更改和部署非常重要。与数据流经数据库(如上一节所述)相比,对于数据流经服务的情况,我们可以做出一个简化的假设:可以合理地假设所有服务器将首先更新,然后是所有客户端。因此,您只需要请求的向后兼容性和响应的前向兼容性。

For evolvability, it is important that RPC clients and servers can be changed and deployed independently. Compared to data flowing through databases (as described in the last section), we can make a simplifying assumption in the case of dataflow through services: it is reasonable to assume that all the servers will be updated first, and all the clients second. Thus, you only need backward compatibility on requests, and forward compatibility on responses.

RPC 方案的向后和向前兼容性属性继承自它使用的任何编码:

The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:

  • Thrift、gRPC(Protocol Buffers)、Avro RPC都可以根据各自编码格式的兼容规则进行演进。

  • Thrift, gRPC (Protocol Buffers), and Avro RPC can be evolved according to the compatibility rules of the respective encoding format.

  • 在 SOAP 中,请求和响应是用 XML 模式指定的。这些可以进化,但存在一些微妙的陷阱[ 47 ]。

  • In SOAP, requests and responses are specified with XML schemas. These can be evolved, but there are some subtle pitfalls [47].

  • RESTful API 最常使用 JSON(没有正式指定的架构)进行响应,并使用 JSON 或 URI 编码/表单编码的请求参数进行请求。添加可选请求参数以及向响应对象添加新字段通常被视为保持兼容性的更改。

  • RESTful APIs most commonly use JSON (without a formally specified schema) for responses, and JSON or URI-encoded/form-encoded request parameters for requests. Adding optional request parameters and adding new fields to response objects are usually considered changes that maintain compatibility.

由于 RPC 通常用于跨组织边界的通信,服务兼容性变得更加困难,因此服务提供者通常无法控制其客户端,也无法强制它们升级。因此,兼容性需要长期维持,甚至可能无限期维持。如果需要进行破坏兼容性的更改,服务提供商通常最终会同时维护多个版本的服务 API。

Service compatibility is made harder by the fact that RPC is often used for communication across organizational boundaries, so the provider of a service often has no control over its clients and cannot force them to upgrade. Thus, compatibility needs to be maintained for a long time, perhaps indefinitely. If a compatibility-breaking change is required, the service provider often ends up maintaining multiple versions of the service API side by side.

对于 API 版本控制应该如何工作(即客户端如何指示它想要使用哪个版本的 API [ 48 ])尚未达成一致。对于 RESTful API,常见的方法是在 URL 或 HTTP 标头中使用版本号Accept。对于使用 API 密钥来识别特定客户端的服务,另一种选择是将客户端请求的 API 版本存储在服务器上,并允许通过单独的管理界面更新此版本选择 [49 ]

There is no agreement on how API versioning should work (i.e., how a client can indicate which version of the API it wants to use [48]). For RESTful APIs, common approaches are to use a version number in the URL or in the HTTP Accept header. For services that use API keys to identify a particular client, another option is to store a client’s requested API version on the server and to allow this version selection to be updated through a separate administrative interface [49].

消息传递数据流

Message-Passing Dataflow

我们一直在研究编码数据从一个进程流向另一个进程的不同方式。到目前为止,我们已经讨论了 REST 和 RPC(其中一个进程通过网络向另一进程发送请求,并期望尽快得到响应)和数据库(其中一个进程写入编码数据,而另一个进程有时会再次读取它)将来)。

We have been looking at the different ways encoded data flows from one process to another. So far, we’ve discussed REST and RPC (where one process sends a request over the network to another process and expects a response as quickly as possible), and databases (where one process writes encoded data, and another process reads it again sometime in the future).

在最后一节中,我们将简要介绍一下异步消息传递系统,它介于 RPC 和数据库之间。它们与 RPC 类似,客户端的请求(通常称为消息以低延迟传递到另一个进程。它们与数据库类似,消息不是通过直接网络连接发送,而是通过称为消息代理(也称为消息队列面向消息的中间件)的中介,该中介临时存储消息。

In this final section, we will briefly look at asynchronous message-passing systems, which are somewhere between RPC and databases. They are similar to RPC in that a client’s request (usually called a message) is delivered to another process with low latency. They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker (also called a message queue or message-oriented middleware), which stores the message temporarily.

与直接 RPC 相比,使用消息代理有几个优点:

Using a message broker has several advantages compared to direct RPC:

  • 如果接收者不可用或过载,它可以充当缓冲区,从而提高系统可靠性。

  • It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.

  • 它可以自动将消息重新传递给已崩溃的进程,从而防止消息丢失。

  • It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.

  • 它避免了发送者需要知道接收者的 IP 地址和端口号(这在虚拟机经常来来去去的云部署中特别有用)。

  • It avoids the sender needing to know the IP address and port number of the recipient (which is particularly useful in a cloud deployment where virtual machines often come and go).

  • 它允许将一封邮件发送给多个收件人。

  • It allows one message to be sent to several recipients.

  • 它在逻辑上将发送者与接收者解耦(发送者只发布消息,并不关心谁消费它们)。

  • It logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).

然而,与 RPC 的不同之处在于,消息传递通信通常是单向的:发送方通常不希望收到对其消息的回复。进程可以发送响应,但这通常在单独的通道上完成。这种通信模式是 异步的:发送者不会等待消息被传递,而是简单地发送它然后就忘记它。

However, a difference compared to RPC is that message-passing communication is usually one-way: a sender normally doesn’t expect to receive a reply to its messages. It is possible for a process to send a response, but this would usually be done on a separate channel. This communication pattern is asynchronous: the sender doesn’t wait for the message to be delivered, but simply sends it and then forgets about it.

消息代理

Message brokers

过去,消息代理的格局由 TIBCO、IBM WebSphere 和 webMethods 等公司的商业企业软件主导。最近,RabbitMQ、ActiveMQ、HornetQ、NATS 和 Apache Kafka 等开源实现变得流行。我们将在第 11 章中更详细地比较它们。

In the past, the landscape of message brokers was dominated by commercial enterprise software from companies such as TIBCO, IBM WebSphere, and webMethods. More recently, open source implementations such as RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka have become popular. We will compare them in more detail in Chapter 11.

详细的传递语义因实现和配置而异,但一般来说,消息代理的使用方式如下:一个进程将消息发送到指定的队列主题代理确保消息被传递到一个或多个消费者订阅者到该队列或主题。同一主题可以有许多生产者和许多消费者。

The detailed delivery semantics vary by implementation and configuration, but in general, message brokers are used as follows: one process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and many consumers on the same topic.

主题仅提供单向数据流。但是,消费者本身可以将消息发布到另一个主题(因此您可以将它们链接在一起,正如我们将在第11 章中看到的那样),或者发布到由原始消息的发送者使用的回复队列(允许请求/响应数据流) ,类似于RPC)。

A topic provides only one-way dataflow. However, a consumer may itself publish messages to another topic (so you can chain them together, as we shall see in Chapter 11), or to a reply queue that is consumed by the sender of the original message (allowing a request/response dataflow, similar to RPC).

消息代理通常不强制执行任何特定的数据模型 - 消息只是带有一些元数据的字节序列,因此您可以使用任何编码格式。如果编码向后和向前兼容,您就可以拥有最大的灵活性来独立更改发布者和消费者并以任何顺序部署它们。

Message brokers typically don’t enforce any particular data model—a message is just a sequence of bytes with some metadata, so you can use any encoding format. If the encoding is backward and forward compatible, you have the greatest flexibility to change publishers and consumers independently and deploy them in any order.

如果消费者将消息重新发布到另一个主题,您可能需要小心保留未知字段,以防止出现前面在数据库上下文中描述的问题(图 4-7)。

If a consumer republishes messages to another topic, you may need to be careful to preserve unknown fields, to prevent the issue described previously in the context of databases (Figure 4-7).

分布式参与者框架

Distributed actor frameworks

Actor模型是单个进程中并发的编程模型。逻辑不是直接处理线程(以及竞争条件、锁定和死锁的相关问题),而是封装在actor中。每个参与者通常代表一个客户端或实体,它可能具有一些本地状态(不与任何其他参与者共享),并且它通过发送和接收异步消息与其他参与者进行通信。无法保证消息传递:在某些错误情况下,消息将会丢失。由于每个 Actor 一次只处理一条消息,因此不需要担心线程,并且每个 Actor 都可以由框架独立调度。

The actor model is a programming model for concurrency in a single process. Rather than dealing directly with threads (and the associated problems of race conditions, locking, and deadlock), logic is encapsulated in actors. Each actor typically represents one client or entity, it may have some local state (which is not shared with any other actor), and it communicates with other actors by sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t need to worry about threads, and each actor can be scheduled independently by the framework.

分布式参与者框架中,此编程模型用于跨多个节点扩展应用程序。无论发送者和接收者位于同一节点还是不同节点,都使用相同的消息传递机制。如果它们位于不同的节点上,则消息会被透明地编码为字节序列,通过网络发送,并在另一端进行解码。

In distributed actor frameworks, this programming model is used to scale an application across multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient are on the same node or different nodes. If they are on different nodes, the message is transparently encoded into a byte sequence, sent over the network, and decoded on the other side.

位置透明性在参与者模型中比在 RPC 中效果更好,因为参与者模型已经假设消息可能会丢失,即使在单个进程中也是如此。尽管网络上的延迟可能比同一进程内的延迟更高,但使用参与者模型时,本地和远程通信之间基本不匹配的情况较少。

Location transparency works better in the actor model than in RPC, because the actor model already assumes that messages may be lost, even within a single process. Although latency over the network is likely higher than within the same process, there is less of a fundamental mismatch between local and remote communication when using the actor model.

分布式参与者框架本质上将消息代理和参与者编程模型集成到单个框架中。但是,如果您想对基于 Actor 的应用程序执行滚动升级,您仍然需要担心向前和向后兼容性,因为消息可能从运行新版本的节点发送到运行旧版本的节点,反之亦然。

A distributed actor framework essentially integrates a message broker and the actor programming model into a single framework. However, if you want to perform rolling upgrades of your actor-based application, you still have to worry about forward and backward compatibility, as messages may be sent from a node running the new version to a node running the old version, and vice versa.

三种流行的分布式 Actor 框架按如下方式处理消息编码:

Three popular distributed actor frameworks handle message encoding as follows:

  • Akka默认使用 Java 内置的序列化,不提供向前或向后兼容性。但是,您可以用协议缓冲区之类的东西替换它,从而获得滚动升级的能力[ 50 ]。

  • Akka uses Java’s built-in serialization by default, which does not provide forward or backward compatibility. However, you can replace it with something like Protocol Buffers, and thus gain the ability to do rolling upgrades [50].

  • Orleans默认使用自定义数据编码格式,不支持滚动升级部署;要部署应用程序的新版本,您需要设置一个新集群,将流量从旧集群移动到新集群,然后关闭旧集群 [ 51 , 52 ]。与 Akka 一样,可以使用自定义序列化插件。

  • Orleans by default uses a custom data encoding format that does not support rolling upgrade deployments; to deploy a new version of your application, you need to set up a new cluster, move traffic from the old cluster to the new one, and shut down the old one [51, 52]. Like with Akka, custom serialization plug-ins can be used.

  • Erlang OTP中,更改记录模式非常困难(尽管系统具有许多为高可用性而设计的功能);滚动升级是可能的,但需要仔细规划[ 53 ]。一种实验性的新maps数据类型(类似于 JSON 的结构,2014 年在 Erlang R17 中引入)可能会让这在未来变得更容易 [ 54 ]。

  • In Erlang OTP it is surprisingly hard to make changes to record schemas (despite the system having many features designed for high availability); rolling upgrades are possible but need to be planned carefully [53]. An experimental new maps datatype (a JSON-like structure, introduced in Erlang R17 in 2014) may make this easier in the future [54].

概括

Summary

在本章中,我们研究了将数据结构转换为网络上的字节或磁盘上的字节的几种方法。我们看到这些编码的细节不仅影响其效率,而且更重要的是影响应用程序的架构以及部署它们的选项。

In this chapter we looked at several ways of turning data structures into bytes on the network or bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more importantly also the architecture of applications and your options for deploying them.

特别是,许多服务需要支持滚动升级,即服务的新版本一次逐步部署到几个节点,而不是同时部署到所有节点。滚动升级允许在不停机的情况下发布服务的新版本(从而鼓励频繁的小版本发布而不是罕见的大版本),并降低部署风险(允许在错误版本影响大量用户之前检测到并回滚)。这些属性对于可演化性以及对应用程序进行更改的便捷性非常有益。

In particular, many services need to support rolling upgrades, where a new version of a service is gradually deployed to a few nodes at a time, rather than deploying to all nodes simultaneously. Rolling upgrades allow new versions of a service to be released without downtime (thus encouraging frequent small releases over rare big releases) and make deployments less risky (allowing faulty releases to be detected and rolled back before they affect a large number of users). These properties are hugely beneficial for evolvability, the ease of making changes to an application.

在滚动升级期间,或由于各种其他原因,我们必须假设不同的节点正在运行应用程序代码的不同版本。因此,重要的是,系统中流动的所有数据都以提供向后兼容性(新代码可以读取旧数据)和前向兼容性(旧代码可以读取新数据)的方式进行编码。

During rolling upgrades, or for various other reasons, we must assume that different nodes are running the different versions of our application’s code. Thus, it is important that all data flowing around the system is encoded in a way that provides backward compatibility (new code can read old data) and forward compatibility (old code can read new data).

我们讨论了几种数据编码格式及其兼容性属性:

We discussed several data encoding formats and their compatibility properties:

  • 编程语言特定的编码仅限于单一编程语言,并且通常无法提供向前和向后兼容性。

  • Programming language–specific encodings are restricted to a single programming language and often fail to provide forward and backward compatibility.

  • JSON、XML 和 CSV 等文本格式很普遍,它们的兼容性取决于您如何使用它们。它们有可选的模式语言,这些语言有时很有帮助,有时却是一种障碍。这些格式对于数据类型有些模糊,因此您必须小心数字和二进制字符串等内容。

  • Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you use them. They have optional schema languages, which are sometimes helpful and sometimes a hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things like numbers and binary strings.

  • Thrift、Protocol Buffers 和 Avro 等二进制模式驱动格式允许紧凑、高效的编码,并具有明确定义的向前和向后兼容性语义。这些模式对于静态类型语言的文档和代码生成非常有用。然而,它们的缺点是数据需要先解码才能被人类读取。

  • Binary schema–driven formats like Thrift, Protocol Buffers, and Avro allow compact, efficient encoding with clearly defined forward and backward compatibility semantics. The schemas can be useful for documentation and code generation in statically typed languages. However, they have the downside that data needs to be decoded before it is human-readable.

我们还讨论了数据流的几种模式,说明了数据编码很重要的不同场景:

We also discussed several modes of dataflow, illustrating different scenarios in which data encodings are important:

  • 数据库,写入数据库的进程对数据进行编码,从数据库读取的进程对其进行解码

  • Databases, where the process writing to the database encodes the data and the process reading from the database decodes it

  • RPC 和 REST API,其中客户端对请求进行编码,服务器对请求进行解码并对响应进行编码,最后客户端对响应进行解码

  • RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response

  • 异步消息传递(使用消息代理或参与者),其中节点通过相互发送由发送者编码并由接收者解码的消息进行通信

  • Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient

我们可以得出结论,只要稍微小心,向后/向前兼容性和滚动升级是完全可以实现的。愿您的应用程序发展迅速,部署频繁。

We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are quite achievable. May your application’s evolution be rapid and your deployments be frequent.

脚注

i某些特殊情况除外,例如某些内存映射文件或直接对压缩数据进行操作(如 “列压缩”中所述)。

i With the exception of some special cases, such as certain memory-mapped files or when operating directly on compressed data (as described in “Column Compression”).

ii请注意,编码与加密 无关。我们在本书中不讨论加密。

ii Note that encoding has nothing to do with encryption. We don’t discuss encryption in this book.

iii实际上,它有三个——BinaryProtocol、CompactProtocol 和 DenseProtocol——尽管 DenseProtocol 仅受 C++ 实现支持,因此它不算是跨语言 [ 18 ]。除此之外,它还有两种不同的基于 JSON 的编码格式 [ 19 ]。多有趣啊!

iii Actually, it has three—BinaryProtocol, CompactProtocol, and DenseProtocol—although DenseProtocol is only supported by the C++ implementation, so it doesn’t count as cross-language [18]. Besides those, it also has two different JSON-based encoding formats [19]. What fun!

iv准确地说,默认值必须是联合体第一个分支的类型

iv To be precise, the default value must be of the type of the first branch of the union, although this is a specific limitation of Avro, not a general feature of union types.

v除了 MySQL 之外,它经常重写整个表,尽管这并非绝对必要,如 “文档模型中的架构灵活性”中所述。

v Except for MySQL, which often rewrites an entire table even though it is not strictly necessary, as mentioned in “Schema flexibility in the document model”.

vi即使在每个阵营内部也存在很多争论。例如,HATEOAS(超媒体作为应用程序状态引擎)经常引发讨论[ 35 ]。

vi Even within each camp there are plenty of arguments. For example, HATEOAS (hypermedia as the engine of application state), often provokes discussions [35].

vii尽管缩写词相似,但 SOAP 并不是 SOA 的必需条件。SOAP 是一种特殊的技术,而 SOA 是构建系统的通用方法。

vii Despite the similarity of acronyms, SOAP is not a requirement for SOA. SOAP is a particular technology, whereas SOA is a general approach to building systems.

参考

[ 1 ]“ Java 对象序列化规范”,docs.oracle.com,2010 年。

[1] “Java Object Serialization Specification,” docs.oracle.com, 2010.

[ 2 ]“ Ruby 2.2.0 API 文档”,ruby-doc.org,2014 年 12 月。

[2] “Ruby 2.2.0 API Documentation,” ruby-doc.org, Dec 2014.

[ 3 ]“ Python 3.4.3 标准库参考手册”,docs.python.org,2015 年 2 月。

[3] “The Python 3.4.3 Standard Library Reference Manual,” docs.python.org, February 2015.

[ 4 ]“ EsotericSoftware/kryo ”, github.com,2014 年 10 月。

[4] “EsotericSoftware/kryo,” github.com, October 2014.

[ 5 ]“ CWE-502:不受信任数据的反序列化”,常见弱点枚举,cwe.mitre.org,2014 年 7 月 30 日。

[5] “CWE-502: Deserialization of Untrusted Data,” Common Weakness Enumeration, cwe.mitre.org, July 30, 2014.

[ 6 ] Steve Breen:“ WebLogic、WebSphere、JBoss、Jenkins、OpenNMS 和您的应用程序有什么共同点?此漏洞”,foxglovesecurity.com,2015 年 11 月 6 日。

[6] Steve Breen: “What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability,” foxglovesecurity.com, November 6, 2015.

[ 7 ] Patrick McKenzie:“ Rails 安全问题对您的初创公司意味着什么”,kalzumeus.com,2013 年 1 月 31 日。

[7] Patrick McKenzie: “What the Rails Security Issue Means for Your Startup,” kalzumeus.com, January 31, 2013.

[ 8 ] Eishay Smith:“ jvm-serializers wiki ”, github.com,2014 年 11 月。

[8] Eishay Smith: “jvm-serializers wiki,” github.com, November 2014.

[ 9 ]“ XML 是 S 表达式的糟糕副本”,c2.com wiki。

[9] “XML Is a Poor Copy of S-Expressions,” c2.com wiki.

[ 10 ] Matt Harris:“ Snowflake:更新和一些非常重要的信息”,发送至Twitter Development Talk邮件列表的电子邮件,2010 年 10 月 19 日。

[10] Matt Harris: “Snowflake: An Update and Some Very Important Information,” email to Twitter Development Talk mailing list, October 19, 2010.

[ 11 ] Shudi(Sandy)Gao、CM Sperberg-McQueen 和 Henry S. Thompson:“ XML Schema 1.1 ”,W3C 建议,2001 年 5 月。

[11] Shudi (Sandy) Gao, C. M. Sperberg-McQueen, and Henry S. Thompson: “XML Schema 1.1,” W3C Recommendation, May 2001.

[ 12 ] Francis Galiegue、Kris Zyp 和 Gary Court:“ JSON 架构”,IETF 互联网草案,2013 年 2 月。

[12] Francis Galiegue, Kris Zyp, and Gary Court: “JSON Schema,” IETF Internet-Draft, February 2013.

[ 13 ] Yakov Shafranovich:“ RFC 4180:逗号分隔值 (CSV) 文件的通用格式和 MIME 类型”,2005 年 10 月。

[13] Yakov Shafranovich: “RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files,” October 2005.

[ 14 ]“ MessagePack 规范”,msgpack.org

[14] “MessagePack Specification,” msgpack.org.

[ 15 ]Mark Slee、Aditya Agarwal 和 Marc Kwiatkowski:“ Thrift:可扩展的跨语言服务实施”,Facebook 技术报告,2007 年 4 月。

[15] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski: “Thrift: Scalable Cross-Language Services Implementation,” Facebook technical report, April 2007.

[ 16 ]“ Protocol Buffers 开发人员指南”,Google, Inc.,developers.google.com

[16] “Protocol Buffers Developer Guide,” Google, Inc., developers.google.com.

[ 17 ] Igor Anishchenko:“ Thrift vs Protocol Buffers vs Avro - 有偏差的比较”,slideshare.net,2012 年 9 月 17 日。

[17] Igor Anishchenko: “Thrift vs Protocol Buffers vs Avro - Biased Comparison,” slideshare.net, September 17, 2012.

[ 18 ]“每个语言库支持的功能矩阵”, wiki.apache.org

[18] “A Matrix of the Features Each Individual Language Library Supports,” wiki.apache.org.

[ 19 ] Martin Kleppmann:“ Avro、Protocol Buffers 和 Thrift 中的架构演化”,martin.kleppmann.com,2012 年 12 月 5 日。

[19] Martin Kleppmann: “Schema Evolution in Avro, Protocol Buffers and Thrift,” martin.kleppmann.com, December 5, 2012.

[ 20 ]“ Apache Avro 1.7.7 文档”,avro.apache.org,2014 年 7 月。

[20] “Apache Avro 1.7.7 Documentation,” avro.apache.org, July 2014.

[ 21 ] Doug Cutting、Chad Walters、Jim Kellerman 等人:“ [提案] 新子项目:Avro ”, hadoop-general邮件列表 上的电子邮件主题, mail-archives.apache.org,2009 年 4 月。

[21] Doug Cutting, Chad Walters, Jim Kellerman, et al.: “[PROPOSAL] New Subproject: Avro,” email thread on hadoop-general mailing list, mail-archives.apache.org, April 2009.

[ 22 ] Tony Hoare:“空引用:价值数十亿美元的错误”,伦敦 QCon,2009 年 3 月。

[22] Tony Hoare: “Null References: The Billion Dollar Mistake,” at QCon London, March 2009.

[ 23 ] Aditya Auradkar 和 Tom Quiggle:“ Espresso 简介——LinkedIn 的热门新分布式文档存储”,engineering.linkedin.com,2015 年 1 月 21 日。

[23] Aditya Auradkar and Tom Quiggle: “Introducing Espresso—LinkedIn’s Hot New Distributed Document Store,” engineering.linkedin.com, January 21, 2015.

[ 24 ] Jay Kreps:“使用 Apache Kafka:构建流数据平台实用指南(第 2 部分) ”,blog.confluence.io,2015 年 2 月 25 日。

[24] Jay Kreps: “Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2),” blog.confluent.io, February 25, 2015.

[ 25 ] Gwen Shapira:“管理模式的问题”,radar.oreilly.com,2014 年 11 月 4 日。

[25] Gwen Shapira: “The Problem of Managing Schemas,” radar.oreilly.com, November 4, 2014.

[ 26 ]“ Apache Pig 0.14.0 文档”,pig.apache.org,2014 年 11 月。

[26] “Apache Pig 0.14.0 Documentation,” pig.apache.org, November 2014.

[ 27 ] 约翰·拉茅斯: ASN.1 完整。摩根·考夫曼,1999。ISBN:978-0-122-33435-1

[27] John Larmouth: ASN.1 Complete. Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1

[ 28 ] Russell Housley、Warwick Ford、Tim Polk 和 David Solo:“ RFC 2459:互联网 X.509 公钥基础设施:证书和 CRL 配置文件”,IETF 网络工作组,标准跟踪,1999 年 1 月。

[28] Russell Housley, Warwick Ford, Tim Polk, and David Solo: “RFC 2459: Internet X.509 Public Key Infrastructure: Certificate and CRL Profile,” IETF Network Working Group, Standards Track, January 1999.

[ 29 ] Lev Walkin:“问题:可扩展性和删除字段”,lionet.info,2010 年 9 月 21 日。

[29] Lev Walkin: “Question: Extensibility and Dropping Fields,” lionet.info, September 21, 2010.

[ 30 ] Jesse James Garrett:“ Ajax:Web 应用程序的新方法”,adaptivepath.com,2005 年 2 月 18 日。

[30] Jesse James Garrett: “Ajax: A New Approach to Web Applications,” adaptivepath.com, February 18, 2005.

[ 31 ] Sam Newman:构建微服务。奥莱利媒体,2015。ISBN:978-1-491-95035-7

[31] Sam Newman: Building Microservices. O’Reilly Media, 2015. ISBN: 978-1-491-95035-7

[ 32 ] Chris Richardson:“微服务:分解应用程序以实现可部署性和可扩展性”,infoq.com,2014 年 5 月 25 日。

[32] Chris Richardson: “Microservices: Decomposing Applications for Deployability and Scalability,” infoq.com, May 25, 2014.

[ 33 ] Pat Helland:“外部数据与内部数据”,第二届创新数据系统研究双年度会议(CIDR),2005 年 1 月。

[33] Pat Helland: “Data on the Outside Versus Data on the Inside,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.

[ 34 ] Roy Thomas Fielding:“架构风格和基于网络的软件架构的设计”,博士论文,加州大学欧文分校,2000 年。

[34] Roy Thomas Fielding: “Architectural Styles and the Design of Network-Based Software Architectures,” PhD Thesis, University of California, Irvine, 2000.

[ 35 ] Roy Thomas Fielding:“ REST API 必须是超文本驱动的”,roy.gbiv.com,2008 年 10 月 20 日。

[35] Roy Thomas Fielding: “REST APIs Must Be Hypertext-Driven,” roy.gbiv.com, October 20 2008.

[ 36 ] “安息吧,SOAP ”,royal.pingdom.com,2010 年 10 月 15 日。

[36] “REST in Peace, SOAP,” royal.pingdom.com, October 15, 2010.

[ 37 ]“ 2007 年第一季度的 Web 服务标准”,innoq.com,2007 年 2 月。

[37] “Web Services Standards as of Q1 2007,” innoq.com, February 2007.

[ 38 ] Pete Lacey:“ S 代表简单”,harmful.cat-v.org,2006 年 11 月 15 日。

[38] Pete Lacey: “The S Stands for Simple,” harmful.cat-v.org, November 15, 2006.

[ 39 ] Stefan Tilkov:“采访:Pete Lacey 批评 Web 服务”,infoq.com,2006 年 12 月 12 日。

[39] Stefan Tilkov: “Interview: Pete Lacey Criticizes Web Services,” infoq.com, December 12, 2006.

[ 40 ]“ OpenAPI 规范(前身为 Swagger RESTful API 文档规范)2.0 版”, swagger.io,2014 年 9 月 8 日。

[40] “OpenAPI Specification (fka Swagger RESTful API Documentation Specification) Version 2.0,” swagger.io, September 8, 2014.

[ 41 ] Michi Henning:“ CORBA 的兴衰”, ACM Queue,第 4 卷,第 5 期,第 28-34 页,2006 年 6 月 。doi:10.1145/1142031.1142044

[41] Michi Henning: “The Rise and Fall of CORBA,” ACM Queue, volume 4, number 5, pages 28–34, June 2006. doi:10.1145/1142031.1142044

[ 42 ] Andrew D. Birrell 和 Bruce Jay Nelson:“实现远程过程调用”,ACM Transactions on Computer Systems (TOCS),第 2 卷,第 1 期,第 39-59 页,1984 年 2 月 。doi:10.1145/2080.357392

[42] Andrew D. Birrell and Bruce Jay Nelson: “Implementing Remote Procedure Calls,” ACM Transactions on Computer Systems (TOCS), volume 2, number 1, pages 39–59, February 1984. doi:10.1145/2080.357392

[ 43 ] Jim Waldo、Geoff Wyant、Ann Wollrath 和 Sam Kendall:“关于分布式计算的说明”,Sun Microsystems Laboratories, Inc.,技术报告 TR-94-29,1994 年 11 月。

[43] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall: “A Note on Distributed Computing,” Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994.

[ 44 ] Steve Vinoski:“便利性优于正确性”,IEEE 互联网计算,第 12 卷,第 4 期,第 89-92 页,2008 年 7 月 。doi:10.1109/MIC.2008.75

[44] Steve Vinoski: “Convenience over Correctness,” IEEE Internet Computing, volume 12, number 4, pages 89–92, July 2008. doi:10.1109/MIC.2008.75

[ 45 ] Marius Eriksen:“ Your Server as a Function ”, 第 7 届编程语言和操作系统研讨会(PLOS),2013 年 11 月 。doi:10.1145/2525528.2525538

[45] Marius Eriksen: “Your Server as a Function,” at 7th Workshop on Programming Languages and Operating Systems (PLOS), November 2013. doi:10.1145/2525528.2525538

[ 46 ]“ grpc-common 文档”,Google, Inc.,github.com,2015 年 2 月。

[46] “grpc-common Documentation,” Google, Inc., github.com, February 2015.

[ 47 ] Aditya Narayan 和 Irina Singh:“设计和版本控制兼容的 Web 服务”,ibm.com,2007 年 3 月 28 日。

[47] Aditya Narayan and Irina Singh: “Designing and Versioning Compatible Web Services,” ibm.com, March 28, 2007.

[ 48 ] Troy Hunt:“您的 API 版本控制是错误的,这就是为什么我决定采用 3 种不同的错误方法” , troyhunt.com,2014 年 2 月 10 日。

[48] Troy Hunt: “Your API Versioning Is Wrong, Which Is Why I Decided to Do It 3 Different Wrong Ways,” troyhunt.com, February 10, 2014.

[ 49 ]“ API 升级”,Stripe, Inc.,2015 年 4 月。

[49] “API Upgrades,” Stripe, Inc., April 2015.

[ 50 ] Jonas Bonér:“在 Akka 集群中升级”,发送至akka-user邮件列表的电子邮件,grokbase.com,2013 年 8 月 28 日。

[50] Jonas Bonér: “Upgrade in an Akka Cluster,” email to akka-user mailing list, grokbase.com, August 28, 2013.

[ 51 ] Philip A. Bernstein、Sergey Bykov、Alan Geller 等人:“奥尔良:用于可编程性和可扩展性的分布式虚拟参与者”,微软研究技术报告 MSR-TR-2014-41,2014 年 3 月。

[51] Philip A. Bernstein, Sergey Bykov, Alan Geller, et al.: “Orleans: Distributed Virtual Actors for Programmability and Scalability,” Microsoft Research Technical Report MSR-TR-2014-41, March 2014.

[ 52 ]“微软奥尔良项目文档”,微软研究院,dotnet.github.io,2015。

[52] “Microsoft Project Orleans Documentation,” Microsoft Research, dotnet.github.io, 2015.

[ 53 ] David Mercer、Sean Hinde、Yinso Chen 和 Richard A O'Keefe:“初学者:更新数据结构”, erlang-questions邮件列表上的电子邮件主题, erlang.com,2007 年 10 月 29 日。

[53] David Mercer, Sean Hinde, Yinso Chen, and Richard A O’Keefe: “beginner: Updating Data Structures,” email thread on erlang-questions mailing list, erlang.com, October 29, 2007.

[ 54 ] Fred Hebert:“后记:地图”,learnyousomeerlang.com,2014 年 4 月 9 日。

[54] Fred Hebert: “Postscript: Maps,” learnyousomeerlang.com, April 9, 2014.

第二部分。分布式数据

Part II. Distributed Data

对于一项成功的技术来说,现实必须优先于公共关系,因为自然是无法被愚弄的。

理查德·费曼,罗杰斯委员会报告(1986)

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

Richard Feynman, Rogers Commission Report (1986)

在本书的第一部分中,我们讨论了数据存储在单台机器上时应用的数据系统的各个方面。现在,在 第二部分中,我们上升一个层次并问:如果多台机器参与数据存储和检索,会发生什么?

In Part I of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in Part II, we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data?

您可能希望将数据库分布在多台计算机上的原因有多种:

There are various reasons why you might want to distribute a database across multiple machines:

可扩展性
Scalability

如果您的数据量、读取负载或写入负载增长到超出单台计算机的处理能力,您可能会将负载分散到多台计算机上。

If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines.

容错/高可用性
Fault tolerance/high availability

如果您的应用程序需要在一台机器(或多台机器、网络或整个数据中心)出现故障时继续工作,您可以使用多台机器来提供冗余。当一个人失败时,另一个人可以接替。

If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy. When one fails, another one can take over.

潜伏
Latency

如果您的用户遍布世界各地,您可能希望在全球不同地点拥有服务器,以便可以从地理位置靠近他们的数据中心为每个用户提供服务。这避免了用户必须等待网络数据包穿越半个地球。

If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. That avoids the users having to wait for network packets to travel halfway around the world.

扩展到更高的负载

Scaling to Higher Load

如果您需要的只是扩展到更高的负载,最简单的方法是购买更强大的机器(有时称为垂直扩展扩展)。许多CPU、许多RAM芯片和许多磁盘可以在一个操作系统下连接在一起,并且快速互连允许任何CPU访问内存或磁盘的任何部分。在这种共享内存架构中,所有组件都可以被视为一台机器[ 1 ]。

If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called vertical scaling or scaling up). Many CPUs, many RAM chips, and many disks can be joined together under one operating system, and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of shared-memory architecture, all the components can be treated as a single machine [1].i

共享内存方法的问题在于成本增长速度快于线性增长:一台拥有两倍 CPU、两倍 RAM 和两倍磁盘容量的机器,其成本通常要高出两倍以上。而且由于瓶颈,两倍大小的机器不一定能处理两倍的负载。

The problem with a shared-memory approach is that the cost grows faster than linearly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much. And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load.

共享内存架构可能提供有限的容错能力——高端机器具有热插拔组件(您可以在不关闭机器的情况下更换磁盘、内存模块甚至CPU)——但它绝对仅限于单个地理位置。

A shared-memory architecture may offer limited fault tolerance—high-end machines have hot-swappable components (you can replace disks, memory modules, and even CPUs without shutting down the machines)—but it is definitely limited to a single geographic location.

另一种方法是共享磁盘架构,它使用多台具有独立 CPU 和 RAM 的机器,但将数据存储在通过快速网络连接的机器之间共享的磁盘阵列上。ii该架构用于某些数据仓库工作负载,但争用和锁定开销限制了共享磁盘方法的可扩展性[ 2 ]。

Another approach is the shared-disk architecture, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.ii This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach [2].

无共享架构

Shared-Nothing Architectures

相比之下,无共享架构 [ 3 ](有时称为水平扩展横向扩展)已经广受欢迎。在这种方法中,运行数据库软件的每台机器或虚拟机称为节点。每个节点独立使用其 CPU、RAM 和磁盘。节点之间的任何协调都是使用传统网络在软件级别完成的。

By contrast, shared-nothing architectures [3] (sometimes called horizontal scaling or scaling out) have gained a lot of popularity. In this approach, each machine or virtual machine running the database software is called a node. Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network.

无共享系统不需要特殊的硬件,因此您可以使用任何具有最佳性价比的机器。您可以将数据分布到多个地理区域,从而减少用户的延迟,并有可能在整个数据中心丢失的情况下幸存下来。通过虚拟机的云部署,您不需要以 Google 规模进行操作:即使对于小公司,多区域分布式架构现在也是可行的。

No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio. You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter. With cloud deployments of virtual machines, you don’t need to be operating at Google scale: even for small companies, a multi-region distributed architecture is now feasible.

在本书的这一部分中,我们重点关注无共享架构,并不是因为它们一定是每个用例的最佳选择,而是因为它们需要您(应用程序开发人员)最为谨慎。如果您的数据分布在多个节点上,您需要了解这种分布式系统中出现的约束和权衡——数据库无法神奇地向您隐藏这些。

In this part of the book, we focus on shared-nothing architectures—not because they are necessarily the best choice for every use case, but rather because they require the most caution from you, the application developer. If your data is distributed across multiple nodes, you need to be aware of the constraints and trade-offs that occur in such a distributed system—the database cannot magically hide these from you.

虽然分布式无共享架构有很多优点,但它通常也会给应用程序带来额外的复杂性,有时还会限制您可以使用的数据模型的表达能力。在某些情况下,简单的单线程程序的性能明显优于具有超过 100 个 CPU 核心的集群 [ 4 ]。另一方面,无共享系统可能非常强大。接下来的几章将详细介绍数据分发时出现的问题。

While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications and sometimes limits the expressiveness of the data models you can use. In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [4]. On the other hand, shared-nothing systems can be very powerful. The next few chapters go into details on the issues that arise when data is distributed.

复制与分区

Replication Versus Partitioning

数据跨多个节点分布有两种常见的方式:

There are two common ways data is distributed across multiple nodes:

复制
Replication

在多个不同节点(可能位于不同位置)上保留相同数据的副本。复制提供冗余:如果某些节点不可用,仍然可以从其余节点提供数据。复制还可以帮助提高性能。我们将在第 5 章中讨论复制。

Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes. Replication can also help improve performance. We discuss replication in Chapter 5.

分区
Partitioning

将大数据库拆分为称为分区的较小子集,以便可以将不同的分区分配给不同的节点(也称为分片)。我们将在第 6 章中讨论分区。

Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding). We discuss partitioning in Chapter 6.

这些是独立的机制,但它们通常齐头并进,如图 II-1所示。

These are separate mechanisms, but they often go hand in hand, as illustrated in Figure II-1.

迪迪亚08
图 II-1。数据库分为两个分区,每个分区有两个副本。

了解这些概念后,我们可以讨论您需要在分布式系统中做出的困难权衡。我们将在 第 7 章中讨论事务,因为这将帮助您了解数据系统中可能出现的所有问题,以及您可以采取哪些措施。我们将通过在第 8章和第 9章中讨论分布式系统的基本限制来结束 本书的这一部分 。

With an understanding of those concepts, we can discuss the difficult trade-offs that you need to make in a distributed system. We’ll discuss transactions in Chapter 7, as that will help you understand all the many things that can go wrong in a data system, and what you can do about them. We’ll conclude this part of the book by discussing the fundamental limitations of distributed systems in Chapters 8 and 9.

稍后,在本书的第三部分中,我们将讨论如何采用多个(可能是分布式的)数据存储并将它们集成到更大的系统中,以满足复杂应用程序的需求。但首先,我们来谈谈分布式数据。

Later, in Part III of this book, we will discuss how you can take several (potentially distributed) datastores and integrate them into a larger system, satisfying the needs of a complex application. But first, let’s talk about distributed data.

脚注

i在大型机器中,尽管任何 CPU 都可以访问内存的任何部分,但某些内存组比其他内存组更靠近一个 CPU(这称为非均匀内存访问,或 NUMA [ 1 ])。为了有效利用这种架构,需要分解处理,以便每个 CPU 主要访问附近的内存,这意味着即使表面上运行在一台机器上,仍然需要分区。

i In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access, or NUMA [1]). To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine.

ii 网络附加存储(NAS) 或存储区域网络(SAN)。

ii Network Attached Storage (NAS) or Storage Area Network (SAN).

参考

[ 1 ] Ulrich Drepper:“每个程序员都应该了解内存”, akkadia.org,2007 年 11 月 21 日。

[1] Ulrich Drepper: “What Every Programmer Should Know About Memory,” akkadia.org, November 21, 2007.

[ 2 ] Ben Stopford:“无共享与共享磁盘架构:独立观点”,benstopford.com,2009 年 11 月 24 日。

[2] Ben Stopford: “Shared Nothing vs. Shared Disk Architectures: An Independent View,” benstopford.com, November 24, 2009.

[ 3 ] Michael Stonebraker:“ The Case for Shared Nothing ”, IEEE 数据库工程公告,第 9 卷,第 1 期,第 4-9 页,1986 年 3 月。

[3] Michael Stonebraker: “The Case for Shared Nothing,” IEEE Database Engineering Bulletin, volume 9, number 1, pages 4–9, March 1986.

[ 4 ] Frank McSherry、Michael Isard 和 Derek G. Murray:“可扩展性!但代价是什么?”,第 15 届 USENIX 操作系统热门话题研讨会(HotOS),2015 年 5 月。

[4] Frank McSherry, Michael Isard, and Derek G. Murray: “Scalability! But at What COST?,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

第 5 章复制

Chapter 5. Replication

可能出错的事物和不可能出错的事物之间的主要区别在于,当不可能出错的事物出错时,通常会发现无法修复或修复。

道格拉斯·亚当斯,《基本无害》(1992)

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.

Douglas Adams, Mostly Harmless (1992)

复制意味着在通过网络连接的多台计算机上保留相同数据的副本。正如第 II 部分的简介中所讨论的,您可能想要复制数据的原因有多种:

Replication means keeping a copy of the same data on multiple machines that are connected via a network. As discussed in the introduction to Part II, there are several reasons why you might want to replicate data:

  • 让数据在地理位置上靠近用户(从而减少延迟)

  • To keep data geographically close to your users (and thus reduce latency)

  • 即使系统的某些部分发生故障,也允许系统继续工作(从而提高可用性)

  • To allow the system to continue working even if some of its parts have failed (and thus increase availability)

  • 扩展可以服务读取查询的机器数量(从而提高读取吞吐量)

  • To scale out the number of machines that can serve read queries (and thus increase read throughput)

在本章中,我们将假设您的数据集非常小,以至于每台机器都可以保存整个数据集的副本。在第 6 章中,我们将放宽这一假设,并讨论对于单台机器来说太大的数据集的分区分片)。在后面的章节中,我们将讨论复制数据系统中可能发生的各种故障以及如何处理它们。

In this chapter we will assume that your dataset is so small that each machine can hold a copy of the entire dataset. In Chapter 6 we will relax that assumption and discuss partitioning (sharding) of datasets that are too big for a single machine. In later chapters we will discuss various kinds of faults that can occur in a replicated data system, and how to deal with them.

如果您要复制的数据不随时间变化,则复制很容易:您只需将数据复制到每个节点一次,就完成了。复制的所有困难都在于处理复制数据的更改,这就是本章的内容。我们将讨论三种流行的在节点之间复制更改的算法:单领导者复制多领导者复制和 无领导者复制。几乎所有分布式数据库都使用这三种方法之一。它们都有不同的优点和缺点,我们将详细研究。

If the data that you’re replicating does not change over time, then replication is easy: you just need to copy the data to every node once, and you’re done. All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss three popular algorithms for replicating changes between nodes: single-leader, multi-leader, and leaderless replication. Almost all distributed databases use one of these three approaches. They all have various pros and cons, which we will examine in detail.

复制需要考虑许多权衡:例如,是使用同步复制还是异步复制,以及如何处理失败的副本。这些通常是数据库中的配置选项,尽管细节因数据库而异,但许多不同实现的一般原则是相似的。我们将在本章中讨论这种选择的后果。

There are many trade-offs to consider with replication: for example, whether to use synchronous or asynchronous replication, and how to handle failed replicas. Those are often configuration options in databases, and although the details vary by database, the general principles are similar across many different implementations. We will discuss the consequences of such choices in this chapter.

数据库的复制是一个古老的话题——自 20 世纪 70 年代研究以来,其原理并没有太大变化 [ 1 ],因为网络的基本约束保持不变。然而,在研究之外,许多开发人员长期以来仍然假设数据库仅由一个节点组成。分布式数据库的主流使用是最近才出现的。由于许多应用程序开发人员是该领域的新手,因此围绕最终一致性等问题存在很多误解。在“复制延迟问题”中,我们将更准确地了解最终一致性,并讨论读你所写单调读保证等问题。

Replication of databases is an old topic—the principles haven’t changed much since they were studied in the 1970s [1], because the fundamental constraints of networks have remained the same. However, outside of research, many developers continued to assume for a long time that a database consisted of just one node. Mainstream use of distributed databases is more recent. Since many application developers are new to this area, there has been a lot of misunderstanding around issues such as eventual consistency. In “Problems with Replication Lag” we will get more precise about eventual consistency and discuss things like the read-your-writes and monotonic reads guarantees.

领导者和追随者

Leaders and Followers

每个存储数据库副本的节点称为副本。对于多个副本,不可避免地会出现一个问题:如何确保所有数据最终都在所有副本上?

Each node that stores a copy of the database is called a replica. With multiple replicas, a question inevitably arises: how do we ensure that all the data ends up on all the replicas?

对数据库的每次写入都需要由每个副本处理;否则,副本将不再包含相同的数据。最常见的解决方案称为基于领导者的复制(也称为主动/被动主从复制),如图 5-1所示 。其工作原理如下:

Every write to the database needs to be processed by every replica; otherwise, the replicas would no longer contain the same data. The most common solution for this is called leader-based replication (also known as active/passive or master–slave replication) and is illustrated in Figure 5-1. It works as follows:

  1. 其中一个副本被指定为领导者(也称为主副本主要副本)。当客户端想要写入数据库时​​,他们必须将请求发送给领导者,领导者首先将新数据写入其本地存储。

  2. One of the replicas is designated the leader (also known as master or primary). When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.

  3. 其他副本称为追随者只读副本从属副本、辅助副本热备用副本)。每当领导者将新数据写入其本地存储时,它还会将数据更改作为 复制日志更改流的一部分发送给其所有追随者。每个追随者从领导者那里获取日志,并通过按照领导者处理的相同顺序应用所有写入,相应地更新其数据库的本地副本。

  4. The other replicas are known as followers (read replicas, slaves, secondaries, or hot standbys).i Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of a replication log or change stream. Each follower takes the log from the leader and updates its local copy of the database accordingly, by applying all writes in the same order as they were processed on the leader.

  5. 当客户端想要从数据库中读取数据时,它可以查询领导者或任何追随者。但是,写入仅在领导者上被接受(从客户端的角度来看,追随者是只读的)。

  6. When a client wants to read from the database, it can query either the leader or any of the followers. However, writes are only accepted on the leader (the followers are read-only from the client’s point of view).

迪迪亚0501
图 5-1。基于领导者(主从)的复制。

这种复制模式是许多关系数据库的内置功能,例如 PostgreSQL(自版本 9.0 起)、MySQL、Oracle Data Guard [ 2 ] 和 SQL Server 的 AlwaysOn 可用性组 [ 3 ]。它也用于一些非关系数据库,包括 MongoDB、RethinkDB 和 Espresso [ 4 ]。最后,基于领导者的复制不仅限于数据库:分布式消息代理,例如 Kafka [ 5 ] 和 RabbitMQ 高可用队列 [ 6 ] 也使用它。一些网络文件系统和复制块设备(例如 DRBD)是相似的。

This mode of replication is a built-in feature of many relational databases, such as PostgreSQL (since version 9.0), MySQL, Oracle Data Guard [2], and SQL Server’s AlwaysOn Availability Groups [3]. It is also used in some nonrelational databases, including MongoDB, RethinkDB, and Espresso [4]. Finally, leader-based replication is not restricted to only databases: distributed message brokers such as Kafka [5] and RabbitMQ highly available queues [6] also use it. Some network filesystems and replicated block devices such as DRBD are similar.

同步复制与异步复制

Synchronous Versus Asynchronous Replication

复制系统的一个重要细节是复制是同步发生还是 异步发生。(在关系数据库中,这通常是一个可配置选项;其他系统通常被硬编码为其中之一。)

An important detail of a replicated system is whether the replication happens synchronously or asynchronously. (In relational databases, this is often a configurable option; other systems are often hardcoded to be either one or the other.)

想想图 5-1中发生的情况,其中网站用户更新了他们的个人资料图片。在某个时间点,客户端向领导者发送更新请求;不久之后,就被领导收到了。在某个时刻,领导者将数据更改转发给追随者。最终,leader通知client更新成功。

Think about what happens in Figure 5-1, where the user of a website updates their profile image. At some point in time, the client sends the update request to the leader; shortly afterward, it is received by the leader. At some point, the leader forwards the data change to the followers. Eventually, the leader notifies the client that the update was successful.

图 5-2显示了系统各个组件之间的通信:用户的客户端、领导者和两个追随者。时间从左向右流动。请求或响应消息显示为粗箭头。

Figure 5-2 shows the communication between various components of the system: the user’s client, the leader, and two followers. Time flows from left to right. A request or response message is shown as a thick arrow.

迪迪亚0502
图 5-2。基于领导者的复制,具有一个同步和一个异步跟随者。

在图 5-2 的示例中,到跟随者 1 的复制是 同步的:领导者会等待,直到跟随者 1 确认收到写入,然后再向用户报告成功,并使写入对其他客户端可见。到跟随者 2 的复制是异步的:领导者发送消息,但不等待跟随者的响应。

In the example of Figure 5-2, the replication to follower 1 is synchronous: the leader waits until follower 1 has confirmed that it received the write before reporting success to the user, and before making the write visible to other clients. The replication to follower 2 is asynchronous: the leader sends the message, but doesn’t wait for a response from the follower.

该图显示,追随者 2 处理消息之前存在相当大的延迟。通常,复制速度非常快:大多数数据库系统在不到一秒的时间内将更改应用到追随者。但是,无法保证可能需要多长时间。在某些情况下,追随者可能会落后领导者几分钟或更长时间;例如,如果跟随者正在从故障中恢复,如果系统正在接近最大容量运行,或者节点之间存在网络问题。

The diagram shows that there is a substantial delay before follower 2 processes the message. Normally, replication is quite fast: most database systems apply changes to followers in less than a second. However, there is no guarantee of how long it might take. There are circumstances when followers might fall behind the leader by several minutes or more; for example, if a follower is recovering from a failure, if the system is operating near maximum capacity, or if there are network problems between the nodes.

同步复制的优点是保证follower拥有与leader一致的最新数据副本。如果领导者突然发生故障,我们可以确定追随者上的数据仍然可用。缺点是如果同步跟随者没有响应(因为它崩溃了,或者网络故障,或者任何其他原因),则无法处理写入。领导者必须阻止所有写入并等待同步副本再次可用。

The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure that the data is still available on the follower. The disadvantage is that if the synchronous follower doesn’t respond (because it has crashed, or there is a network fault, or for any other reason), the write cannot be processed. The leader must block all writes and wait until the synchronous replica is available again.

因此,所有追随者同步是不切实际的:任何一个节点中断都会导致整个系统陷入瘫痪。实际上,如果在数据库上启用同步复制,通常意味着其中一个追随者是同步的,而其他追随者是异步的。如果同步跟随者变得不可用或缓慢,则异步跟随者之一将变为同步。这可以保证您在至少两个节点上拥有最新的数据副本:领导者和一个同步跟随者。这种配置有时也称为半同步[ 7 ]。

For that reason, it is impractical for all followers to be synchronous: any one node outage would cause the whole system to grind to a halt. In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous. This guarantees that you have an up-to-date copy of the data on at least two nodes: the leader and one synchronous follower. This configuration is sometimes also called semi-synchronous [7].

通常,基于领导者的复制被配置为完全异步。在这种情况下,如果领导者发生故障并且不可恢复,则任何尚未复制到追随者的写入都会丢失。这意味着即使已向客户端确认写入,也不能保证写入是持久的。然而,完全异步配置的优点是,即使所有追随者都落后了,领导者也可以继续处理写入。

Often, leader-based replication is configured to be completely asynchronous. In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost. This means that a write is not guaranteed to be durable, even if it has been confirmed to the client. However, a fully asynchronous configuration has the advantage that the leader can continue processing writes, even if all of its followers have fallen behind.

削弱持久性听起来像是一个糟糕的权衡,但异步复制仍然被广泛使用,特别是如果有很多追随者或者他们分布在不同的地理位置。我们将在“复制延迟问题”中回到这个问题。

Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless widely used, especially if there are many followers or if they are geographically distributed. We will return to this issue in “Problems with Replication Lag”.

设置新的关注者

Setting Up New Followers

有时,您需要设置新的关注者 - 也许是为了增加副本数量,或者替换失败的节点。如何确保新的追随者拥有领导者数据的准确副本?

From time to time, you need to set up new followers—perhaps to increase the number of replicas, or to replace failed nodes. How do you ensure that the new follower has an accurate copy of the leader’s data?

简单地将数据文件从一个节点复制到另一个节点通常是不够的:客户端不断写入数据库,并且数据始终在变化,因此标准文件副本会在不同时间点看到数据库的不同部分。结果可能没有任何意义。

Simply copying data files from one node to another is typically not sufficient: clients are constantly writing to the database, and the data is always in flux, so a standard file copy would see different parts of the database at different points in time. The result might not make any sense.

您可以通过锁定数据库(使其不可写入)来使磁盘上的文件保持一致,但这将违背我们的高可用性目标。幸运的是,设置追随者通常可以在不停机的情况下完成。从概念上讲,该过程如下所示:

You could make the files on disk consistent by locking the database (making it unavailable for writes), but that would go against our goal of high availability. Fortunately, setting up a follower can usually be done without downtime. Conceptually, the process looks like this:

  1. 在某个时间点拍摄领导者数据库的一致快照 - 如果可能的话,不要锁定整个数据库。大多数数据库都有此功能,因为备份也需要它。在某些情况下,需要第三方工具,例如MySQL 的innobackupex [ 12 ]。

  2. Take a consistent snapshot of the leader’s database at some point in time—if possible, without taking a lock on the entire database. Most databases have this feature, as it is also required for backups. In some cases, third-party tools are needed, such as innobackupex for MySQL [12].

  3. 将快照复制到新的从属节点。

  4. Copy the snapshot to the new follower node.

  5. 追随者连接到领导者并请求自拍摄快照以来发生的所有数据更改。这要求快照与领导者复制日志中的确切位置相关联。该位置有各种名称:例如,PostgreSQL 将其称为日志序列号,MySQL 将其称为二进制日志坐标

  6. The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. This requires that the snapshot is associated with an exact position in the leader’s replication log. That position has various names: for example, PostgreSQL calls it the log sequence number, and MySQL calls it the binlog coordinates.

  7. 当follower处理完快照以来积压的数据变化时,我们说它已经 赶上了。现在,它可以在领导者发生数据更改时继续处理这些更改。

  8. When the follower has processed the backlog of data changes since the snapshot, we say it has caught up. It can now continue to process data changes from the leader as they happen.

设置关注者的实际步骤因数据库而异。在某些系统中,该过程是完全自动化的,而在其他系统中,它可能是一个有点神秘的多步骤工作流程,需要由管理员手动执行。

The practical steps of setting up a follower vary significantly by database. In some systems the process is fully automated, whereas in others it can be a somewhat arcane multi-step workflow that needs to be manually performed by an administrator.

处理节点中断

Handling Node Outages

系统中的任何节点都可能会停机,可能是由于故障而意外停机,但也可能是由于计划维护(例如,重新启动计算机以安装内核安全补丁)而停机。能够在不停机的情况下重新启动各个节点对于运维来说是一个很大的优势。因此,我们的目标是在单个节点发生故障的情况下保持系统整体运行,并将节点中断的影响尽可能小。

Any node in the system can go down, perhaps unexpectedly due to a fault, but just as likely due to planned maintenance (for example, rebooting a machine to install a kernel security patch). Being able to reboot individual nodes without downtime is a big advantage for operations and maintenance. Thus, our goal is to keep the system as a whole running despite individual node failures, and to keep the impact of a node outage as small as possible.

如何通过基于领导者的复制实现高可用性?

How do you achieve high availability with leader-based replication?

从机故障:追赶恢复

Follower failure: Catch-up recovery

每个追随者在其本地磁盘上保存从领导者接收到的数据更改的日志。如果追随者崩溃并重新启动,或者领导者和追随者之间的网络暂时中断,追随者可以很容易地恢复:从其日志中,它知道故障发生之前处理的最后一个事务。因此,follower可以连接到leader并请求follower断开连接期间发生的所有数据更改。当它应用了这些更改后,它就赶上了领导者,并且可以像以前一样继续接收数据更改流。

On its local disk, each follower keeps a log of the data changes it has received from the leader. If a follower crashes and is restarted, or if the network between the leader and the follower is temporarily interrupted, the follower can recover quite easily: from its log, it knows the last transaction that was processed before the fault occurred. Thus, the follower can connect to the leader and request all the data changes that occurred during the time when the follower was disconnected. When it has applied these changes, it has caught up to the leader and can continue receiving a stream of data changes as before.

领导失败:故障转移

Leader failure: Failover

处理领导者的失败比较棘手:其中一个追随者需要晋升为新的领导者,需要重新配置客户端以将其写入发送给新的领导者,而其他追随者需要开始使用新领导者的数据更改领导者。这个过程称为 故障转移

Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. This process is called failover.

故障转移可以手动发生(管理员收到领导者发生故障的通知,并采取必要的步骤来建立新的领导者)或自动发生。自动故障转移过程通常包含以下步骤:

Failover can happen manually (an administrator is notified that the leader has failed and takes the necessary steps to make a new leader) or automatically. An automatic failover process usually consists of the following steps:

  1. 确定领导者失败了。有很多事情可能会出现问题:崩溃、断电、网络问题等等。没有万无一失的方法来检测出了什么问题,因此大多数系统只是使用超时:节点经常在彼此之间来回反弹消息,如果节点在一段时间内(例如 30 秒)没有响应它被认为已经死了。(如果领导者因计划维护而被故意关闭,则这不适用。)

  2. Determining that the leader has failed. There are many things that could potentially go wrong: crashes, power outages, network issues, and more. There is no foolproof way of detecting what has gone wrong, so most systems simply use a timeout: nodes frequently bounce messages back and forth between each other, and if a node doesn’t respond for some period of time—say, 30 seconds—it is assumed to be dead. (If the leader is deliberately taken down for planned maintenance, this doesn’t apply.)

  3. 选择新的领导者。这可以通过选举过程来完成(其中领导者由大多数剩余副本选出),或者新的领导者可以由先前选举的控制节点指定 。领导的最佳候选者通常是具有来自旧领导者的最新数据更改的副本(以最大限度地减少任何数据丢失)。让所有节点就新领导者达成一致是一个共识问题,第 9 章详细讨论。

  4. Choosing a new leader. This could be done through an election process (where the leader is chosen by a majority of the remaining replicas), or a new leader could be appointed by a previously elected controller node. The best candidate for leadership is usually the replica with the most up-to-date data changes from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader is a consensus problem, discussed in detail in Chapter 9.

  5. 重新配置系统以使用新的领导者。客户端现在需要将 写入请求发送给新的领导者(我们在“请求路由”中讨论这一点)。如果旧的领导者回来,它可能仍然相信自己是领导者,而没有意识到其他副本已经迫使它下台。系统需要确保旧的领导者成为追随者并认可新的领导者。

  6. Reconfiguring the system to use the new leader. Clients now need to send their write requests to the new leader (we discuss this in “Request Routing”). If the old leader comes back, it might still believe that it is the leader, not realizing that the other replicas have forced it to step down. The system needs to ensure that the old leader becomes a follower and recognizes the new leader.

故障转移充满了可能出错的事情:

Failover is fraught with things that can go wrong:

  • 如果使用异步复制,新的领导者在失败之前可能还没有收到旧领导者的所有写入。如果在选择新领导者后,前领导者重新加入集群,那么这些写入会发生什么?与此同时,新领导者可能收到了相互冲突的写入。最常见的解决方案是简单地丢弃旧领导者的未复制写入,这可能会违反客户的持久性期望。

  • If asynchronous replication is used, the new leader may not have received all the writes from the old leader before it failed. If the former leader rejoins the cluster after a new leader has been chosen, what should happen to those writes? The new leader may have received conflicting writes in the meantime. The most common solution is for the old leader’s unreplicated writes to simply be discarded, which may violate clients’ durability expectations.

  • 如果数据库外部的其他存储系统需要与数据库内容协调,则丢弃写入尤其危险。 例如,在 GitHub [ 13 ] 的一次事件中,一位过时的 MySQL follower 被提升为领导者。数据库使用自动递增计数器将主键分配给新行,但由于新领导者的计数器落后于旧领导者的计数器,因此它重用了旧领导者先前分配的一些主键。这些主键也被用在Redis存储中,因此主键的重用导致了MySQL和Redis之间的不一致,从而导致一些私有数据被泄露给错误的用户。

  • Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents. For example, in one incident at GitHub [13], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users.

  • 在某些故障场景中(参见第8章),可能会发生两个节点都认为自己是领导者的情况。这种情况被称为裂脑,这是很危险的:如果两个领导者都接受写入,并且没有解决冲突的过程(参见 “多领导者复制”),数据很可能会丢失或损坏。作为一项安全措施,某些系统有一种机制,可以在检测到两个领导者时关闭一个节点。ii 但是,如果此机制设计不仔细,最终可能会导致两个节点都被关闭 [ 14 ]。

  • In certain fault scenarios (see Chapter 8), it could happen that two nodes both believe that they are the leader. This situation is called split brain, and it is dangerous: if both leaders accept writes, and there is no process for resolving conflicts (see “Multi-Leader Replication”), data is likely to be lost or corrupted. As a safety catch, some systems have a mechanism to shut down one node if two leaders are detected.ii However, if this mechanism is not carefully designed, you can end up with both nodes being shut down [14].

  • 在宣布领导者死亡之前,正确的超时时间是多少?较长的超时意味着在领导者发生故障的情况下恢复的时间较长。但是,如果超时太短,可能会出现不必要的故障转移。例如,临时负载峰值可能会导致节点的响应时间增加到超时以上,或者网络故障可能会导致数据包延迟。如果系统已经在应对高负载或网络问题,不必要的故障转移可能会使情况变得更糟,而不是更好。

  • What is the right timeout before the leader is declared dead? A longer timeout means a longer time to recovery in the case where the leader fails. However, if the timeout is too short, there could be unnecessary failovers. For example, a temporary load spike could cause a node’s response time to increase above the timeout, or a network glitch could cause delayed packets. If the system is already struggling with high load or network problems, an unnecessary failover is likely to make the situation worse, not better.

这些问题没有简单的解决方案。因此,一些运营团队更喜欢手动执行故障转移,即使软件支持自动故障转移。

There are no easy solutions to these problems. For this reason, some operations teams prefer to perform failovers manually, even if the software supports automatic failover.

这些问题——节点故障;不可靠的网络;围绕副本一致性、持久性、可用性和延迟的权衡实际上是分布式系统中的基本问题。在第 8 章和第 9 章中,我们将更深入地讨论它们。

These issues—node failures; unreliable networks; and trade-offs around replica consistency, durability, availability, and latency—are in fact fundamental problems in distributed systems. In Chapter 8 and Chapter 9 we will discuss them in greater depth.

复制日志的实现

Implementation of Replication Logs

基于领导者的复制在幕后是如何工作的?实践中使用了几种不同的复制方法,因此让我们简要介绍一下每种方法。

How does leader-based replication work under the hood? Several different replication methods are used in practice, so let’s look at each one briefly.

基于语句的复制

Statement-based replication

在最简单的情况下,领导者记录它执行的每个写入请求(语句)并将该语句日志发送给其追随者。对于关系数据库,这意味着每个INSERTUPDATEDELETE语句都会转发给关注者,每个关注者都会解析并执行该 SQL 语句,就像从客户端接收到该语句一样

In the simplest case, the leader logs every write request (statement) that it executes and sends that statement log to its followers. For a relational database, this means that every INSERT, UPDATE, or DELETE statement is forwarded to followers, and each follower parses and executes that SQL statement as if it had been received from a client.

尽管这听起来很合理,但这种复制方法可能会通过多种方式失效:

Although this may sound reasonable, there are various ways in which this approach to replication can break down:

  • 任何调用非确定性函数的语句(例如NOW()获取当前日期和时间或RAND()获取随机数)都可能在每个副本上生成不同的值。

  • Any statement that calls a nondeterministic function, such as NOW() to get the current date and time or RAND() to get a random number, is likely to generate a different value on each replica.

  • 如果语句使用自动增量列,或者如果它们依赖于数据库中的现有数据(例如,),则它们必须在每个副本上以完全相同的顺序执行,否则它们可能会产生不同的效果。当有多个并发执行的事务时,这可能会受到限制。UPDATE … WHERE <some condition>

  • If statements use an autoincrementing column, or if they depend on the existing data in the database (e.g., UPDATE … WHERE <some condition>), they must be executed in exactly the same order on each replica, or else they may have a different effect. This can be limiting when there are multiple concurrently executing transactions.

  • 具有副作用的语句(例如触发器、存储过程、用户定义函数)可能会导致每个副本上出现不同的副作用,除非副作用是绝对确定性的。

  • Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless the side effects are absolutely deterministic.

解决这些问题是可能的,例如,领导者可以在记录语句时用固定的返回值替换任何不确定的函数调用,以便追随者都获得相同的值。然而,由于存在如此多的边缘情况,现在通常首选其他复制方法。

It is possible to work around those issues—for example, the leader can replace any nondeterministic function calls with a fixed return value when the statement is logged so that the followers all get the same value. However, because there are so many edge cases, other replication methods are now generally preferred.

MySQL 5.1 版本之前使用基于语句的复制。今天它仍然有时被使用,因为它非常紧凑,但默认情况下,如果语句中存在任何不确定性,MySQL 现在会切换到基于行的复制(稍后讨论)。VoltDB 使用基于语句的复制,并通过要求事务具有确定性来使其安全 [ 15 ]。

Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today, as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it safe by requiring transactions to be deterministic [15].

预写日志 (WAL) 传送

Write-ahead log (WAL) shipping

第3章中,我们讨论了存储引擎如何表示磁盘上的数据,我们发现通常每次写入都会附加到日志中:

In Chapter 3 we discussed how storage engines represent data on disk, and we found that usually every write is appended to a log:

  • 对于日志结构存储引擎(参见“SSTables 和 LSM-Trees”),该日志是主要的存储位置。日志段在后台进行压缩和垃圾收集。

  • In the case of a log-structured storage engine (see “SSTables and LSM-Trees”), this log is the main place for storage. Log segments are compacted and garbage-collected in the background.

  • 对于覆盖单个磁盘块的 B 树(请参阅“B 树”)来说,每次修改都会首先写入预写日志,以便索引可以在崩溃后恢复到一致的状态。

  • In the case of a B-tree (see “B-Trees”), which overwrites individual disk blocks, every modification is first written to a write-ahead log so that the index can be restored to a consistent state after a crash.

无论哪种情况,日志都是仅附加的字节序列,包含对数据库的所有写入。我们可以使用完全相同的日志在另一个节点上构建副本:除了将日志写入磁盘之外,领导者还通过网络将其发送给其追随者。当追随者处理此日志时,它会构建与领导者上完全相同的数据结构的副本。

In either case, the log is an append-only sequence of bytes containing all writes to the database. We can use the exact same log to build a replica on another node: besides writing the log to disk, the leader also sends it across the network to its followers. When the follower processes this log, it builds a copy of the exact same data structures as found on the leader.

这种复制方法用于 PostgreSQL 和 Oracle 等 [ 16 ]。主要缺点是日志在非常低的级别上描述数据:WAL 包含哪些磁盘块中哪些字节被更改的详细信息。这使得复制与存储引擎紧密耦合。如果数据库将其存储格式从一种版本更改为另一种版本,通常不可能在领导者和跟随者上运行不同版本的数据库软件。

This method of replication is used in PostgreSQL and Oracle, among others [16]. The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks. This makes replication closely coupled to the storage engine. If the database changes its storage format from one version to another, it is typically not possible to run different versions of the database software on the leader and the followers.

这看起来似乎是一个很小的实施细节,但它可能会产生很大的运营影响。如果复制协议允许follower使用比leader更新的软件版本,则可以通过先升级follower,然后执行故障转移,使升级后的节点之一成为新的leader,从而对数据库软件进行零停机升级。如果复制协议不允许此版本不匹配(WAL 传送经常出现这种情况),则此类升级需要停机。

That may seem like a minor implementation detail, but it can have a big operational impact. If the replication protocol allows the follower to use a newer software version than the leader, you can perform a zero-downtime upgrade of the database software by first upgrading the followers and then performing a failover to make one of the upgraded nodes the new leader. If the replication protocol does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require downtime.

逻辑(基于行)日志复制

Logical (row-based) log replication

另一种方法是对复制和存储引擎使用不同的日志格式,这允许复制日志与存储引擎内部分离。这种复制日志称为逻辑日志,以区别于存储引擎的(物理)数据表示。

An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals. This kind of replication log is called a logical log, to distinguish it from the storage engine’s (physical) data representation.

关系数据库的逻辑日志通常是一系列记录,以行的粒度描述对数据库表的写入:

A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:

  • 对于插入的行,日志包含所有列的新值。

  • For an inserted row, the log contains the new values of all columns.

  • 对于已删除的行,日志包含足够的信息来唯一标识已删除的行。通常这将是主键,但如果表上没有主键,则需要记录所有列的旧值。

  • For a deleted row, the log contains enough information to uniquely identify the row that was deleted. Typically this would be the primary key, but if there is no primary key on the table, the old values of all columns need to be logged.

  • 对于更新的行,日志包含足够的信息来唯一标识更新的行以及所有列的新值(或至少是所有更改的列的新值)。

  • For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns (or at least the new values of all columns that changed).

修改多行的事务会生成多个此类日志记录,后跟一条指示该事务已提交的记录。MySQL 的 binlog(当配置为使用基于行的复制时)使用这种方法 [ 17 ]。

A transaction that modifies several rows generates several such log records, followed by a record indicating that the transaction was committed. MySQL’s binlog (when configured to use row-based replication) uses this approach [17].

由于逻辑日志与存储引擎内部解耦,因此可以更轻松地保持向后兼容,从而允许领导者和跟随者运行不同版本的数据库软件,甚至不同的存储引擎。

Since a logical log is decoupled from the storage engine internals, it can more easily be kept backward compatible, allowing the leader and the follower to run different versions of the database software, or even different storage engines.

逻辑日志格式也更容易被外部应用程序解析。如果您想要将数据库的内容发送到外部系统(例如用于离线分析的数据仓库,或用于构建自定义索引和缓存[18]),此方面非常有用。这种技术称为变更数据捕获,我们将在第 11 章中再次讨论它。

A logical log format is also easier for external applications to parse. This aspect is useful if you want to send the contents of a database to an external system, such as a data warehouse for offline analysis, or for building custom indexes and caches [18]. This technique is called change data capture, and we will return to it in Chapter 11.

基于触发器的复制

Trigger-based replication

到目前为止描述的复制方法是由数据库系统实现的,不涉及任何应用程序代码。在许多情况下,这就是您想要的,但在某些情况下需要更大的灵活性。例如,如果您只想复制数据的子集,或者想要从一种数据库复制到另一种数据库 ,或者需要冲突解决逻辑(请参阅“处理写入冲突”),那么您可能需要移动复制一直到应用层。

The replication approaches described so far are implemented by the database system, without involving any application code. In many cases, that’s what you want—but there are some circumstances where more flexibility is needed. For example, if you want to only replicate a subset of the data, or want to replicate from one kind of database to another, or if you need conflict resolution logic (see “Handling Write Conflicts”), then you may need to move replication up to the application layer.

一些工具,例如 Oracle GoldenGate [ 19 ],可以通过读取数据库日志来使数据更改可供应用程序使用。另一种方法是使用许多关系数据库中提供的功能:触发器存储过程

Some tools, such as Oracle GoldenGate [19], can make data changes available to an application by reading the database log. An alternative is to use features that are available in many relational databases: triggers and stored procedures.

触发器允许您注册自定义应用程序代码,当数据库系统中发生数据更改(写入事务)时,该代码会自动执行。触发器有机会将此更改记录到一个单独的表中,外部进程可以从中读取它。然后,该外部进程可以应用任何必要的应用程序逻辑并将数据更改复制到另一个系统。例如,Oracle [ 20 ] 的 Databus 和 Postgres [ 21 ]的 Bucardo就是这样工作的。

A trigger lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system. The trigger has the opportunity to log this change into a separate table, from which it can be read by an external process. That external process can then apply any necessary application logic and replicate the data change to another system. Databus for Oracle [20] and Bucardo for Postgres [21] work like this, for example.

基于触发器的复制通常比其他复制方法具有更大的开销,并且比数据库的内置复制更容易出现错误和限制。然而,由于其灵活性,它仍然很有用。

Trigger-based replication typically has greater overheads than other replication methods, and is more prone to bugs and limitations than the database’s built-in replication. However, it can nevertheless be useful due to its flexibility.

复制滞后问题

Problems with Replication Lag

能够容忍节点故障只是需要复制的原因之一。正如第二部分的简介中提到的,其他原因是可扩展性(处理的请求数量超出单台机器可以处理的数量)和延迟(将副本放置在地理位置上更靠近用户)。

Being able to tolerate node failures is just one reason for wanting replication. As mentioned in the introduction to Part II, other reasons are scalability (processing more requests than a single machine can handle) and latency (placing replicas geographically closer to users).

基于领导者的复制要求所有写入都通过单个节点,但只读查询可以访问任何副本。对于主要由读取和少量写入组成的工作负载(网络上的常见模式),有一个有吸引力的选择:创建许多关注者,并在这些关注者之间分配读取请求。这会消除领导者的负载,并允许附近的副本处理读取请求。

Leader-based replication requires all writes to go through a single node, but read-only queries can go to any replica. For workloads that consist of mostly reads and only a small percentage of writes (a common pattern on the web), there is an attractive option: create many followers, and distribute the read requests across those followers. This removes load from the leader and allows read requests to be served by nearby replicas.

在这种读取扩展架构中,您只需添加更多关注者即可提高处理只读请求的能力。然而,这种方法实际上只适用于异步复制 - 如果您尝试同步复制到所有追随者,单个节点故障或网络中断将使整个系统无法写入。节点越多,其中一个节点宕机的可能性就越大,因此完全同步的配置将非常不可靠。

In this read-scaling architecture, you can increase the capacity for serving read-only requests simply by adding more followers. However, this approach only realistically works with asynchronous replication—if you tried to synchronously replicate to all followers, a single node failure or network outage would make the entire system unavailable for writing. And the more nodes you have, the likelier it is that one will be down, so a fully synchronous configuration would be very unreliable.

不幸的是,如果应用程序从异步追随者读取数据,如果追随者落后了,它可能会看到过时的信息。这会导致数据库中明显的不一致:如果您同时在领导者和追随者上运行相同的查询,则可能会得到不同的结果,因为并非所有写入都已反映在追随者中。这种不一致只是一种临时状态——如果您停止写入数据库并等待一段时间,追随者最终会赶上并与领导者保持一致。因此,这种效应被称为最终一致性[ 22 , 23 ]。三、

Unfortunately, if an application reads from an asynchronous follower, it may see outdated information if the follower has fallen behind. This leads to apparent inconsistencies in the database: if you run the same query on the leader and a follower at the same time, you may get different results, because not all writes have been reflected in the follower. This inconsistency is just a temporary state—if you stop writing to the database and wait a while, the followers will eventually catch up and become consistent with the leader. For that reason, this effect is known as eventual consistency [22, 23].iii

“最终”这个术语故意含糊不清:一般来说,副本可以落后多远是没有限制的。在正常操作中,领导者上发生的写入与追随者上反映的写入之间的延迟(复制延迟)可能只有几分之一秒,并且在实践中并不明显。但是,如果系统接近满负荷运行或网络出现问题,则延迟很容易增加到几秒甚至几分钟。

The term “eventually” is deliberately vague: in general, there is no limit to how far a replica can fall behind. In normal operation, the delay between a write happening on the leader and being reflected on a follower—the replication lag—may be only a fraction of a second, and not noticeable in practice. However, if the system is operating near capacity or if there is a problem in the network, the lag can easily increase to several seconds or even minutes.

当滞后如此之大时,它引入的不一致不仅仅是一个理论问题,而且是一个实际的应用问题。在本节中,我们将重点介绍存在复制滞后时可能出现的问题的三个示例,并概述解决这些问题的一些方法。

When the lag is so large, the inconsistencies it introduces are not just a theoretical issue but a real problem for applications. In this section we will highlight three examples of problems that are likely to occur when there is replication lag and outline some approaches to solving them.

读你自己写的

Reading Your Own Writes

许多应用程序允许用户提交一些数据,然后查看他们提交的内容。这可能是客户数据库中的记录,或者讨论线程上的评论,或者其他类似的东西。当新数据提交时,必须发送给leader,但是当用户查看数据时,可以从follower处读取。如果经常查看数据但只是偶尔写入数据,则这尤其合适。

Many applications let the user submit some data and then view what they have submitted. This might be a record in a customer database, or a comment on a discussion thread, or something else of that sort. When new data is submitted, it must be sent to the leader, but when the user views the data, it can be read from a follower. This is especially appropriate if data is frequently viewed but only occasionally written.

对于异步复制,存在一个问题,如图 5-3 所示 :如果用户在写入后不久查看数据,则新数据可能尚未到达副本。对于用户来说,他们提交的数据看起来好像丢失了,所以他们会不高兴,这是可以理解的。

With asynchronous replication, there is a problem, illustrated in Figure 5-3: if the user views the data shortly after making a write, the new data may not yet have reached the replica. To the user, it looks as though the data they submitted was lost, so they will be understandably unhappy.

迪迪亚0503
图 5-3。用户进行写入,然后从过时的副本中进行读取。为了防止这种异常情况,我们需要写后读一致性。

在这种情况下,我们需要写后读一致性,也称为读你所写一致性 [ 24 ]。这是一个保证,如果用户重新加载页面,他们将始终看到他们自己提交的任何更新。它不对其他用户做出任何承诺:其他用户的更新可能要到稍后时间才可见。但是,它可以让用户放心,他们自己的输入已正确保存。

In this situation, we need read-after-write consistency, also known as read-your-writes consistency [24]. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves. It makes no promises about other users: other users’ updates may not be visible until some later time. However, it reassures the user that their own input has been saved correctly.

我们如何在基于领导者复制的系统中实现写后读一致性?有多种可能的技术。提几个:

How can we implement read-after-write consistency in a system with leader-based replication? There are various possible techniques. To mention a few:

  • 当读取用户可能修改过的东西时,从leader处读取;否则,从关注者那里读取它。这要求您有某种方式知道某些内容是否已被修改,而无需实际查询它。例如,社交网络上的用户个人资料信息通常只能由个人资料的所有者编辑,而其他任何人都不能编辑。因此,一个简单的规则是:始终从领导者处读取用户自己的个人资料,并从追随者处读取任何其他用户的个人资料。

  • When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower. This requires that you have some way of knowing whether something might have been modified, without actually querying it. For example, user profile information on a social network is normally only editable by the owner of the profile, not by anybody else. Thus, a simple rule is: always read the user’s own profile from the leader, and any other users’ profiles from a follower.

  • 如果应用程序中的大多数内容都可以由用户编辑,则该方法将不会有效,因为大多数内容都必须从领导者处读取(否定了读取扩展的好处)。在这种情况下,可以使用其他标准来决定是否从领导者处读取。例如,您可以跟踪上次更新的时间,并在上次更新后的一分钟内从领导者处进行所有读取。您还可以监视追随者的复制延迟,并防止对落后领导者一分钟以上的任何追随者进行查询。

  • If most things in the application are potentially editable by the user, that approach won’t be effective, as most things would have to be read from the leader (negating the benefit of read scaling). In that case, other criteria may be used to decide whether to read from the leader. For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader.

  • 客户端可以记住其最近写入的时间戳,然后系统可以确保为该用户提供任何读取服务的副本至少反映该时间戳之前的更新。如果副本不够最新,则读取可以由另一个副本处理,或者查询可以等待副本赶上。 时间戳可以是逻辑时间戳(指示写入顺序的东西,例如日志序列号)或实际系统时钟(在这种情况下时钟同步变得至关重要;请参阅“不可靠的时钟”)。

  • The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If a replica is not sufficiently up to date, either the read can be handled by another replica or the query can wait until the replica has caught up. The timestamp could be a logical timestamp (something that indicates ordering of writes, such as the log sequence number) or the actual system clock (in which case clock synchronization becomes critical; see “Unreliable Clocks”).

  • 如果您的副本分布在多个数据中心(为了在地理上接近用户或为了可用性),则会产生额外的复杂性。任何需要领导者提供服务的请求都必须路由到包含领导者的数据中心。

  • If your replicas are distributed across multiple datacenters (for geographical proximity to users or for availability), there is additional complexity. Any request that needs to be served by the leader must be routed to the datacenter that contains the leader.

当同一用户从多个设备(例如桌面 Web 浏览器和移动应用程序)访问您的服务时,会出现另一个复杂情况。在这种情况下,您可能希望提供跨设备 写入后读取一致性:如果用户在一个设备上输入一些信息,然后在另一台设备上查看它,他们应该看到刚刚输入的信息。

Another complication arises when the same user is accessing your service from multiple devices, for example a desktop web browser and a mobile app. In this case you may want to provide cross-device read-after-write consistency: if the user enters some information on one device and then views it on another device, they should see the information they just entered.

在这种情况下,还需要考虑一些其他问题:

In this case, there are some additional issues to consider:

  • 需要记住用户上次更新的时间戳的方法变得更加困难,因为在一台设备上运行的代码不知道另一台设备上发生了什么更新。该元数据需要集中化。

  • Approaches that require remembering the timestamp of the user’s last update become more difficult, because the code running on one device doesn’t know what updates have happened on the other device. This metadata will need to be centralized.

  • 如果您的副本分布在不同的数据中心,则无法保证来自不同设备的连接将路由到同一数据中心。(例如,如果用户的台式计算机使用家庭宽带连接,而他们的移动设备使用蜂窝数据网络,则设备的网络路由可能完全不同。)如果您的方法需要从领导者那里读取,您可能首先需要路由所有用户设备向同一数据中心发出的请求。

  • If your replicas are distributed across different datacenters, there is no guarantee that connections from different devices will be routed to the same datacenter. (For example, if the user’s desktop computer uses the home broadband connection and their mobile device uses the cellular data network, the devices’ network routes may be completely different.) If your approach requires reading from the leader, you may first need to route requests from all of a user’s devices to the same datacenter.

单调读取

Monotonic Reads

我们的第二个异常示例是,当从异步关注者读取数据时,用户可能会看到事物在时间上向后移动

Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s possible for a user to see things moving backward in time.

如果用户从不同的副本进行多次读取,则可能会发生这种情况。例如 图5-4显示用户 2345 两次进行相同的查询,第一次是向延迟较小的关注者,然后是向延迟较大的关注者。(如果用户刷新网页,并且每个请求都路由到随机服务器,则很可能出现这种情况。)第一个查询返回用户 1234 最近添加的评论,但第二个查询不返回任何内容,因为落后的追随者尚未接收到该写入。实际上,第二个查询在比第一个查询更早的时间点观察系统。如果第一个查询没有返回任何内容,情况也不会那么糟糕,因为用户 2345 可能不知道用户 1234 最近添加了评论。然而,如果用户 2345 首先看到用户 1234 的评论出现,然后又看到它消失,那么他们会感到非常困惑。

This can happen if a user makes several reads from different replicas. For example, Figure 5-4 shows user 2345 making the same query twice, first to a follower with little lag, then to a follower with greater lag. (This scenario is quite likely if the user refreshes a web page, and each request is routed to a random server.) The first query returns a comment that was recently added by user 1234, but the second query doesn’t return anything because the lagging follower has not yet picked up that write. In effect, the second query is observing the system at an earlier point in time than the first query. This wouldn’t be so bad if the first query hadn’t returned anything, because user 2345 probably wouldn’t know that user 1234 had recently added a comment. However, it’s very confusing for user 2345 if they first see user 1234’s comment appear, and then see it disappear again.

迪迪亚0504
图 5-4。用户首先从新副本读取,然后从陈旧副本读取。时间似乎倒退了。为了防止这种异常情况,我们需要单调读取。

单调读取[ 23 ]是这种异常不会发生的保证。它的保证比强一致性要弱,但比最终一致性的保证更强。当您读取数据时,您可能会看到旧值;单调读取仅意味着如果一个用户按顺序进行多次读取,他们将不会看到时间倒退,即,他们在先前读取了较新的数据后不会再读取较旧的数据。

Monotonic reads [23] is a guarantee that this kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward—i.e., they will not read older data after having previously read newer data.

实现单调读取的一种方法是确保每个用户始终从同一个副本进行读取(不同的用户可以从不同的副本读取)。例如,可以基于用户 ID 的散列来选择副本,而不是随机选择。但是,如果该副本失败,则用户的查询将需要重新路由到另一个副本。

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica (different users can read from different replicas). For example, the replica can be chosen based on a hash of the user ID, rather than randomly. However, if that replica fails, the user’s queries will need to be rerouted to another replica.

一致的前缀读取

Consistent Prefix Reads

我们的第三个复制滞后异常示例涉及违反因果关系。想象一下庞斯先生和蛋糕夫人之间的以下简短对话:

Our third example of replication lag anomalies concerns violation of causality. Imagine the following short dialog between Mr. Poons and Mrs. Cake:

潘斯先生
Mr. Poons

蛋糕夫人,你能看到多远的未来?

How far into the future can you see, Mrs. Cake?

蛋糕夫人
Mrs. Cake

通常大约十秒钟,潘斯先生。

About ten seconds usually, Mr. Poons.

这两句话之间存在因果关系:Cake 夫人听到了 Poons 先生的问题并回答了它。

There is a causal dependency between those two sentences: Mrs. Cake heard Mr. Poons’s question and answered it.

现在,想象一下第三个人正在通过关注者收听这段对话。Cake夫人所说的事情经过跟随者时几乎没有滞后,但Poons先生所说的事情有较长的复制滞后(见图5-5)。该观察者会听到以下内容:

Now, imagine a third person is listening to this conversation through followers. The things said by Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer replication lag (see Figure 5-5). This observer would hear the following:

蛋糕夫人
Mrs. Cake

通常大约十秒钟,潘斯先生。

About ten seconds usually, Mr. Poons.

潘斯先生
Mr. Poons

蛋糕夫人,你能看到多远的未来?

How far into the future can you see, Mrs. Cake?

在观察者看来,凯克夫人似乎在潘斯先生提出问题之前就已经回答了这个问题。这种精神力量令人印象深刻,但也非常令人困惑[ 25 ]。

To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked it. Such psychic powers are impressive, but very confusing [25].

迪迪亚0505
图 5-5。如果某些分区的复制速度比其他分区慢,观察者可能会在看到问题之前就看到答案。

防止这种异常需要另一种类型的保证:一致的前缀读取 [ 23 ]。此保证表示,如果一系列写入按特定顺序发生,则任何读取这些写入的人都会看到它们以相同的顺序出现。

Preventing this kind of anomaly requires another type of guarantee: consistent prefix reads [23]. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

这是分区(分片)数据库中的一个特殊问题,我们将在 第 6 章中讨论。如果数据库始终以相同的顺序应用写入,则读取始终会看到一致的前缀,因此不会发生这种异常。然而,在许多分布式数据库中,不同的分区独立操作,因此不存在全局写入顺序:当用户从数据库读取时,他们可能会看到数据库的某些部分处于较旧的状态,而另一些部分则处于较新的状态。

This is a particular problem in partitioned (sharded) databases, which we will discuss in Chapter 6. If the database always applies writes in the same order, reads always see a consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different partitions operate independently, so there is no global ordering of writes: when a user reads from the database, they may see some parts of the database in an older state and some in a newer state.

一种解决方案是确保彼此因果相关的任何写入都写入同一分区,但在某些应用程序中无法有效完成。还有一些算法可以显式跟踪因果依赖关系,我们将在 ““发生在”之前的关系和并发”中回到这个主题。

One solution is to make sure that any writes that are causally related to each other are written to the same partition—but in some applications that cannot be done efficiently. There are also algorithms that explicitly keep track of causal dependencies, a topic that we will return to in “The “happens-before” relationship and concurrency”.

复制延迟的解决方案

Solutions for Replication Lag

使用最终一致的系统时,值得考虑如果复制延迟增加到几分钟甚至几小时,应用程序的行为方式。如果答案是“没问题”,那就太好了。但是,如果结果是给用户带来不好的体验,那么设计系统以提供更强的保证就很重要,例如先写后读。假装复制是同步的,而实际上是异步的,这会导致后续出现问题。

When working with an eventually consistent system, it is worth thinking about how the application behaves if the replication lag increases to several minutes or even hours. If the answer is “no problem,” that’s great. However, if the result is a bad experience for users, it’s important to design the system to provide a stronger guarantee, such as read-after-write. Pretending that replication is synchronous when in fact it is asynchronous is a recipe for problems down the line.

如前所述,应用程序可以通过多种方式提供比底层数据库更强的保证,例如,通过在领导者上执行某些类型的读取。然而,在应用程序代码中处理这些问题很复杂并且很容易出错。

As discussed earlier, there are ways in which an application can provide a stronger guarantee than the underlying database—for example, by performing certain kinds of reads on the leader. However, dealing with these issues in application code is complex and easy to get wrong.

如果应用程序开发人员不必担心微妙的复制问题,并且可以相信他们的数据库“做正确的事情”,那就更好了。这就是事务存在的原因:它们是数据库提供更强保证的一种方式,以便应用程序可以更简单。

It would be better if application developers didn’t have to worry about subtle replication issues and could just trust their databases to “do the right thing.” This is why transactions exist: they are a way for a database to provide stronger guarantees so that the application can be simpler.

单节点交易已经存在很长时间了。然而,在转向分布式(复制和分区)数据库的过程中,许多系统已经放弃了它们,声称事务在性能和可用性方面过于昂贵,并断言最终一致性在可扩展系统中是不可避免的。这种说法有一定道理,但过于简单化,我们将在本书的其余部分中形成更细致的观点。我们将回到第 7 章和第 9章中的事务主题 ,并在第三部分 中讨论一些替代机制。

Single-node transactions have existed for a long time. However, in the move to distributed (replicated and partitioned) databases, many systems have abandoned them, claiming that transactions are too expensive in terms of performance and availability, and asserting that eventual consistency is inevitable in a scalable system. There is some truth in that statement, but it is overly simplistic, and we will develop a more nuanced view over the course of the rest of this book. We will return to the topic of transactions in Chapters 7 and 9, and we will discuss some alternative mechanisms in Part III.

多领导者复制

Multi-Leader Replication

到目前为止,在本章中我们只考虑了使用单个领导者的复制架构。尽管这是一种常见的方法,但还有一些有趣的替代方法。

So far in this chapter we have only considered replication architectures using a single leader. Although that is a common approach, there are interesting alternatives.

基于领导者的复制有一个主要缺点:只有一个领导者,所有写入都必须经过它。iv如果您因任何原因无法连接到领导者,例如由于您和领导者之间的网络中断,您将无法写入数据库。

Leader-based replication has one major downside: there is only one leader, and all writes must go through it.iv If you can’t connect to the leader for any reason, for example due to a network interruption between you and the leader, you can’t write to the database.

基于领导者的复制模型的一种自然扩展是允许多个节点接受写入。复制仍然以相同的方式进行:处理写入的每个节点必须将该数据更改转发到所有其他节点。我们称之为多领导者配置(也称为 主-主主动/主动复制)。在这种设置中,每个领导者同时充当其他领导者的追随者。

A natural extension of the leader-based replication model is to allow more than one node to accept writes. Replication still happens in the same way: each node that processes a write must forward that data change to all the other nodes. We call this a multi-leader configuration (also known as master–master or active/active replication). In this setup, each leader simultaneously acts as a follower to the other leaders.

多领导者复制的用例

Use Cases for Multi-Leader Replication

在单个数据中心内使用多领导者设置几乎没有意义,因为好处很少超过增加的复杂性。然而,在某些情况下这种配置是合理的。

It rarely makes sense to use a multi-leader setup within a single datacenter, because the benefits rarely outweigh the added complexity. However, there are some situations in which this configuration is reasonable.

多数据中心运行

Multi-datacenter operation

想象一下,您有一个数据库,其副本位于多个不同的数据中心(也许是为了您可以容忍整个数据中心的故障,或者也许是为了更接近您的用户)。使用正常的基于领导者的复制设置,领导者必须位于其中一个数据中心,并且所有写入都必须通过该数据中心。

Imagine you have a database with replicas in several different datacenters (perhaps so that you can tolerate failure of an entire datacenter, or perhaps in order to be closer to your users). With a normal leader-based replication setup, the leader has to be in one of the datacenters, and all writes must go through that datacenter.

在多领导者配置中,每个数据中心都可以有一个领导者。 图 5-6显示了该架构的外观。在每个数据中心内,使用常规的领导者-跟随者复制;在数据中心之间,每个数据中心的领导者将其更改复制到其他数据中心的领导者。

In a multi-leader configuration, you can have a leader in each datacenter. Figure 5-6 shows what this architecture might look like. Within each datacenter, regular leader–follower replication is used; between datacenters, each datacenter’s leader replicates its changes to the leaders in other datacenters.

迪迪亚0506
图 5-6。跨多个数据中心的多主复制。

让我们比较一下单领导者和多领导者配置在多数据中心部署中的表现:

Let’s compare how the single-leader and multi-leader configurations fare in a multi-datacenter deployment:

表现
Performance

在单领导者配置中,每次写入都必须通过互联网传输到领导者的数据中心。这可能会显着增加写入延迟,并且可能违背最初拥有多个数据中心的目的。在多领导者配置中,每次写入都可以在本地数据中心处理,并异步复制到其他数据中心。因此,数据中心间的网络延迟对用户来说是隐藏的,这意味着感知的性能可能会更好。

In a single-leader configuration, every write must go over the internet to the datacenter with the leader. This can add significant latency to writes and might contravene the purpose of having multiple datacenters in the first place. In a multi-leader configuration, every write can be processed in the local datacenter and is replicated asynchronously to the other datacenters. Thus, the inter-datacenter network delay is hidden from users, which means the perceived performance may be better.

数据中心中断的容忍度
Tolerance of datacenter outages

在单领导者配置中,如果领导者所在的数据中心发生故障,故障转移可以将另一个数据中心中的追随者提升为领导者。在多领导者配置中,每个数据中心都可以独立于其他数据中心继续运行,并且当发生故障的数据中心恢复在线时复制会赶上。

In a single-leader configuration, if the datacenter with the leader fails, failover can promote a follower in another datacenter to be leader. In a multi-leader configuration, each datacenter can continue operating independently of the others, and replication catches up when the failed datacenter comes back online.

网络问题的容忍度
Tolerance of network problems

数据中心之间的流量通常通过公共互联网进行, 这可能不如数据中心内的本地网络可靠。单领导者配置对此数据中心间链接中的问题非常敏感,因为写入是通过此链接同步进行的。具有异步复制的多领导者配置通常可以更好地容忍网络问题:临时网络中断不会阻止写入处理。

Traffic between datacenters usually goes over the public internet, which may be less reliable than the local network within a datacenter. A single-leader configuration is very sensitive to problems in this inter-datacenter link, because writes are made synchronously over this link. A multi-leader configuration with asynchronous replication can usually tolerate network problems better: a temporary network interruption does not prevent writes being processed.

一些数据库默认支持多领导者配置,但也经常使用外部工具来实现,例如 MySQL 的 Tungsten Replicator [ 26 ]、PostgreSQL 的 BDR [ 27 ] 和 Oracle 的 GoldenGate [ 19 ]。

Some databases support multi-leader configurations by default, but it is also often implemented with external tools, such as Tungsten Replicator for MySQL [26], BDR for PostgreSQL [27], and GoldenGate for Oracle [19].

虽然多主复制有优点,但它也有一个很大的缺点:相同的数据可能在两个不同的数据中心同时修改,并且必须解决这些写入冲突(在图5-6中表示为“冲突解决” 我们将在“处理写入冲突”中讨论这个问题 。

Although multi-leader replication has advantages, it also has a big downside: the same data may be concurrently modified in two different datacenters, and those write conflicts must be resolved (indicated as “conflict resolution” in Figure 5-6). We will discuss this issue in “Handling Write Conflicts”.

由于多主复制是许多数据库中的一项改进功能,因此通常存在微妙的配置陷阱以及与其他数据库功能的令人惊讶的交互。例如,自动增量键、触发器和完整性约束可能会出现问题。因此,多领导者复制通常被认为是危险领域,应尽可能避免[ 28 ]。

As multi-leader replication is a somewhat retrofitted feature in many databases, there are often subtle configuration pitfalls and surprising interactions with other database features. For example, autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible [28].

离线操作的客户

Clients with offline operation

适合多领导者复制的另一种情况是,如果您有一个应用程序需要在与 Internet 断开连接时继续工作。

Another situation in which multi-leader replication is appropriate is if you have an application that needs to continue to work while it is disconnected from the internet.

例如,考虑手机、笔记本电脑和其他设备上的日历应用程序。您需要能够随时查看会议(发出读取请求)并进入新会议(发出写入请求),无论您的设备当前是否有互联网连接。如果您在离线时进行任何更改,则需要在设备下次上线时将它们与服务器和其他设备同步。

For example, consider the calendar apps on your mobile phone, your laptop, and other devices. You need to be able to see your meetings (make read requests) and enter new meetings (make write requests) at any time, regardless of whether your device currently has an internet connection. If you make any changes while you are offline, they need to be synced with a server and your other devices when the device is next online.

在这种情况下,每个设备都有一个充当领导者的本地数据库(它接受写入请求),并且所有设备上的日历副本之间存在异步多领导者复制过程(同步)。复制延迟可能是几个小时甚至几天,具体取决于您何时可以访问互联网。

In this case, every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices. The replication lag may be hours or even days, depending on when you have internet access available.

从架构的角度来看,这种设置本质上与数据中心之间的多主复制相同,但走向极端:每个设备都是一个“数据中心”,它们之间的网络连接极其不可靠。正如日历同步实施失败的丰富历史所表明的那样,多领导者复制是一件很难做到正确的事情。

From an architectural point of view, this setup is essentially the same as multi-leader replication between datacenters, taken to the extreme: each device is a “datacenter,” and the network connection between them is extremely unreliable. As the rich history of broken calendar sync implementations demonstrates, multi-leader replication is a tricky thing to get right.

有一些工具旨在使这种多领导者配置变得更容易。例如,CouchDB 就是为这种操作模式而设计的[ 29 ]。

There are tools that aim to make this kind of multi-leader configuration easier. For example, CouchDB is designed for this mode of operation [29].

协同编辑

Collaborative editing

实时协作编辑应用程序允许多人同时编辑文档。例如,Etherpad [ 30 ]和Google Docs [ 31 ]允许多人同时编辑文本文档或电子表格(该算法在“自动冲突解决”中简要讨论)。

Real-time collaborative editing applications allow several people to edit a document simultaneously. For example, Etherpad [30] and Google Docs [31] allow multiple people to concurrently edit a text document or spreadsheet (the algorithm is briefly discussed in “Automatic Conflict Resolution”).

我们通常不会将协作编辑视为数据库复制问题,但它与前面提到的离线编辑用例有很多共同点。当一个用户编辑文档时,更改会立即应用到其本地副本(Web 浏览器或客户端应用程序中的文档状态),并异步复制到服务器和正在编辑同一文档的任何其他用户。

We don’t usually think of collaborative editing as a database replication problem, but it has a lot in common with the previously mentioned offline editing use case. When one user edits a document, the changes are instantly applied to their local replica (the state of the document in their web browser or client application) and asynchronously replicated to the server and any other users who are editing the same document.

如果要保证不会出现编辑冲突,应用程序必须先获得文档的锁定,然后用户才能对其进行编辑。如果另一个用户想要编辑同一文档,他们首先必须等待第一个用户提交更改并释放锁定。这种协作模型相当于单领导者复制,交易在领导者上。

If you want to guarantee that there will be no editing conflicts, the application must obtain a lock on the document before a user can edit it. If another user wants to edit the same document, they first have to wait until the first user has committed their changes and released the lock. This collaboration model is equivalent to single-leader replication with transactions on the leader.

然而,为了更快地协作,您可能希望使更改单位非常小(例如,单次击键)并避免锁定。这种方法允许多个用户同时编辑,但它也带来了多领导者复制的所有挑战,包括需要解决冲突[ 32 ]。

However, for faster collaboration, you may want to make the unit of change very small (e.g., a single keystroke) and avoid locking. This approach allows multiple users to edit simultaneously, but it also brings all the challenges of multi-leader replication, including requiring conflict resolution [32].

处理写冲突

Handling Write Conflicts

多主复制最大的问题是可能会发生写冲突,这意味着需要解决冲突。

The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.

例如,考虑一个由两个用户同时编辑的 wiki 页面,如图 5-7所示。用户1将页面标题从A更改为B,用户2同时将页面标题从A更改为C。每个用户的更改都会成功应用于其本地领导者。然而,当异步复制更改时,会检测到冲突[ 33 ]。在单领导者数据库中不会出现此问题。

For example, consider a wiki page that is simultaneously being edited by two users, as shown in Figure 5-7. User 1 changes the title of the page from A to B, and user 2 changes the title from A to C at the same time. Each user’s change is successfully applied to their local leader. However, when the changes are asynchronously replicated, a conflict is detected [33]. This problem does not occur in a single-leader database.

迪迪亚0507
图 5-7。由于两个领导者同时更新同一条记录而导致的写入冲突。

同步与异步冲突检测

Synchronous versus asynchronous conflict detection

在单领导者数据库中,第二个写入器将阻塞并等待第一个写入完成,或者中止第二个写入事务,迫使用户重试写入。另一方面,在多领导者设置中,两次写入都会成功,并且仅在稍后的某个时间点异步检测到冲突。到那时再要求用户解决冲突可能就为时已晚了。

In a single-leader database, the second writer will either block and wait for the first write to complete, or abort the second write transaction, forcing the user to retry the write. On the other hand, in a multi-leader setup, both writes are successful, and the conflict is only detected asynchronously at some later point in time. At that time, it may be too late to ask the user to resolve the conflict.

原则上,您可以使冲突检测同步,即等待写入复制到所有副本,然后再告诉用户写入成功。但是,这样做会失去多领导者复制的主要优势:允许每个副本独立接受写入。如果您想要同步冲突检测,您不妨使用单领导者复制。

In principle, you could make the conflict detection synchronous—i.e., wait for the write to be replicated to all replicas before telling the user that the write was successful. However, by doing so, you would lose the main advantage of multi-leader replication: allowing each replica to accept writes independently. If you want synchronous conflict detection, you might as well just use single-leader replication.

避免冲突

Conflict avoidance

处理冲突的最简单策略是避免冲突:如果应用程序可以确保特定记录的所有写入都经过同一个领导者,则不会发生冲突。由于多领导者复制的许多实现处理冲突的能力相当差,因此避免冲突是经常推荐的方法[ 34 ]。

The simplest strategy for dealing with conflicts is to avoid them: if the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur. Since many implementations of multi-leader replication handle conflicts quite poorly, avoiding conflicts is a frequently recommended approach [34].

例如,在用户可以编辑自己数据的应用程序中,您可以确保来自特定用户的请求始终路由到同一个数据中心,并使用该数据中心中的领导者进行读写。不同的用户可能有不同的“主”数据中心(可能是根据与用户的地理位置接近程度来选择的),但从任何一个用户的角度来看,配置本质上都是单领导者。

For example, in an application where a user can edit their own data, you can ensure that requests from a particular user are always routed to the same datacenter and use the leader in that datacenter for reading and writing. Different users may have different “home” datacenters (perhaps picked based on geographic proximity to the user), but from any one user’s point of view the configuration is essentially single-leader.

但是,有时您可能想要更改记录的指定领导者 - 可能是因为一个数据中心发生故障,您需要将流量重新路由到另一个数据中心,或者可能是因为用户已移动到不同的位置并且现在更接近另一个数据中心。在这种情况下,冲突避免就失效了,你必须处理不同领导者并发写入的可能性。

However, sometimes you might want to change the designated leader for a record—perhaps because one datacenter has failed and you need to reroute traffic to another datacenter, or perhaps because a user has moved to a different location and is now closer to a different datacenter. In this situation, conflict avoidance breaks down, and you have to deal with the possibility of concurrent writes on different leaders.

趋向一致状态

Converging toward a consistent state

单领导者数据库按顺序应用写入:如果同一字段有多次更新,则最后一次写入决定该字段的最终值。

A single-leader database applies writes in a sequential order: if there are several updates to the same field, the last write determines the final value of the field.

在多领导者配置中,没有定义的写入顺序,因此不清楚最终值应该是什么。在图5-7中,在leader 1处,标题首先更新为B,然后更新为C;在领导者 2 处,它首先更新为 C,然后更新为 B。这两个顺序都不比另一个“更正确”。

In a multi-leader configuration, there is no defined ordering of writes, so it’s not clear what the final value should be. In Figure 5-7, at leader 1 the title is first updated to B and then to C; at leader 2 it is first updated to C and then to B. Neither order is “more correct” than the other.

如果每个副本只是按照其看到写入的顺序应用写入,则数据库最终将处于不一致的状态:最终值将在领导者 1 处为 C,在领导者 2 处为 B。这是不可接受的 - 每个复制方案都必须确保所有副本中的数据最终都是相同的。因此,数据库必须以收敛的方式解决冲突,这意味着在复制所有更改后,所有副本必须达到相同的最终值。

If each replica simply applied writes in the order that it saw the writes, the database would end up in an inconsistent state: the final value would be C at leader 1 and B at leader 2. That is not acceptable—every replication scheme must ensure that the data is eventually the same in all replicas. Thus, the database must resolve the conflict in a convergent way, which means that all replicas must arrive at the same final value when all changes have been replicated.

有多种方法可以实现聚合冲突解决:

There are various ways of achieving convergent conflict resolution:

  • 给每个写入一个唯一的ID(例如,时间戳、长随机数、UUID或键和值的散列),选择具有最高ID的写入作为获胜者,并丢弃其他写入。如果使用时间戳,则此技术称为最后写入获胜(LWW)。尽管这种方法很流行,但它很容易导致数据丢失[ 35 ]。我们将在本章末尾(“检测并发写入”)​​更详细地讨论 LWW。

  • Give each write a unique ID (e.g., a timestamp, a long random number, a UUID, or a hash of the key and value), pick the write with the highest ID as the winner, and throw away the other writes. If a timestamp is used, this technique is known as last write wins (LWW). Although this approach is popular, it is dangerously prone to data loss [35]. We will discuss LWW in more detail at the end of this chapter (“Detecting Concurrent Writes”).

  • 为每个副本提供唯一的 ID,并让源自编号较高的副本的写入始终优先于源自编号较低的副本的写入。这种方法也意味着数据丢失。

  • Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica. This approach also implies data loss.

  • 以某种方式将这些值合并在一起,例如,按字母顺序对它们进行排序,然后将它们连接起来(在 图 5-7中,合并后的标题可能类似于“B/C”)。

  • Somehow merge the values together—e.g., order them alphabetically and then concatenate them (in Figure 5-7, the merged title might be something like “B/C”).

  • 将冲突记录在保留所有信息的显式数据结构中,并编写稍后解决冲突的应用程序代码(可能通过提示用户)。

  • Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict at some later time (perhaps by prompting the user).

自定义冲突解决逻辑

Custom conflict resolution logic

由于解决冲突的最合适方法可能取决于应用程序,因此大多数多领导者复制工具允许您使用应用程序代码编写冲突解决逻辑。该代码可以在写入或读取时执行:

As the most appropriate way of resolving a conflict may depend on the application, most multi-leader replication tools let you write conflict resolution logic using application code. That code may be executed on write or on read:

写入时
On write

一旦数据库系统检测到复制更改日志中存在冲突,它就会调用冲突处理程序。例如,Bucardo 允许您为此目的编写 Perl 片段。该处理程序通常无法提示用户,它在后台进程中运行,并且必须快速执行。

As soon as the database system detects a conflict in the log of replicated changes, it calls the conflict handler. For example, Bucardo allows you to write a snippet of Perl for this purpose. This handler typically cannot prompt a user—it runs in a background process and it must execute quickly.

正在阅读
On read

当检测到冲突时,所有冲突的写入都会被存储。下次读取数据时,这些多个版本的数据将返回给应用程序。应用程序可能会提示用户或自动解决冲突,并将结果写回数据库。例如,CouchDB 就是这样工作的。

When a conflict is detected, all the conflicting writes are stored. The next time the data is read, these multiple versions of the data are returned to the application. The application may prompt the user or automatically resolve the conflict, and write the result back to the database. CouchDB works this way, for example.

请注意,冲突解决通常适用于单个行或文档的级别,而不适用于整个事务 [ 36 ]。因此,如果您有一个事务以原子方式进行多个不同的写入(请参阅 第 7 章),则出于解决冲突的目的,仍会单独考虑每个写入。

Note that conflict resolution usually applies at the level of an individual row or document, not for an entire transaction [36]. Thus, if you have a transaction that atomically makes several different writes (see Chapter 7), each write is still considered separately for the purposes of conflict resolution.

什么是冲突?

What is a conflict?

有些冲突是显而易见的。在图 5-7的示例中,两次写入同时修改了同一记录中的同一字段,将其设置为两个不同的值。毫无疑问,这是一场冲突。

Some kinds of conflict are obvious. In the example in Figure 5-7, two writes concurrently modified the same field in the same record, setting it to two different values. There is little doubt that this is a conflict.

其他类型的冲突可能更难以察觉。例如,考虑一个会议室预订系统:它跟踪哪组人在何时预订了哪个房间。该应用程序需要确保每个房间在同一时间仅由一组人预订(即同一房间不得有任何重叠预订)。在这种情况下,如果同时为同一房间创建两个不同的预订,则可能会出现冲突。即使应用程序在允许用户进行预订之前检查可用性,如果两个预订是针对两个不同的领导者进行的,则可能会发生冲突。

Other kinds of conflict can be more subtle to detect. For example, consider a meeting room booking system: it tracks which room is booked by which group of people at which time. This application needs to ensure that each room is only booked by one group of people at any one time (i.e., there must not be any overlapping bookings for the same room). In this case, a conflict may arise if two different bookings are created for the same room at the same time. Even if the application checks availability before allowing a user to make a booking, there can be a conflict if the two bookings are made on two different leaders.

没有一个快速现成的答案,但在接下来的章节中,我们将追踪一个很好理解这个问题的路径。我们将在第 7 章中看到更多冲突示例 ,在第 12 章中我们将讨论用于检测和解决复制系统中的冲突的可扩展方法。

There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a good understanding of this problem. We will see some more examples of conflicts in Chapter 7, and in Chapter 12 we will discuss scalable approaches for detecting and resolving conflicts in a replicated system.

多领导者复制拓扑

Multi-Leader Replication Topologies

复制拓扑 描述了写入从一个节点传播到另一个节点所沿的通信路径。如果您有两个领导者,如图5-7所示,则只有一种合理的拓扑:领导者 1 必须将其所有写入发送到领导者 2,反之亦然。如果有两个以上的领导者,则可以实现各种不同的拓扑。一些示例如图 5-8所示 。

A replication topology describes the communication paths along which writes are propagated from one node to another. If you have two leaders, like in Figure 5-7, there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible. Some examples are illustrated in Figure 5-8.

迪迪亚0508
图 5-8。可以设置多领导者复制的三个示例拓扑。

最通用的拓扑是全对全图 5-8 [c]),其中每个领导者将其写入发送给其他每个领导者。然而,也使用了更受限制的拓扑:例如,MySQL 默认情况下仅支持循环拓扑 [ 34 ],其中每个节点接收来自一个节点的写入并将这些写入(加上其自己的任何写入)转发到另一个节点。另一种流行的拓扑具有星形形状 :v一个指定的根节点将写入转发到所有其他节点。星形拓扑可以推广为树形。

The most general topology is all-to-all (Figure 5-8 [c]), in which every leader sends its writes to every other leader. However, more restricted topologies are also used: for example, MySQL by default supports only a circular topology [34], in which each node receives writes from one node and forwards those writes (plus any writes of its own) to one other node. Another popular topology has the shape of a star:v one designated root node forwards writes to all of the other nodes. The star topology can be generalized to a tree.

在圆形和星形拓扑中,写入可能需要经过多个节点才能到达所有副本。因此,节点需要转发从其他节点接收到的数据更改。为了防止无限复制循环,每个节点都被赋予一个唯一的标识符,并且在复制日志中,每次写入都用它所经过的所有节点的标识符进行标记[ 43 ]。当节点接收到用其自己的标识符标记的数据更改时,该数据更改将被忽略,因为该节点知道它已经被处理。

In circular and star topologies, a write may need to pass through several nodes before it reaches all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To prevent infinite replication loops, each node is given a unique identifier, and in the replication log, each write is tagged with the identifiers of all the nodes it has passed through [43]. When a node receives a data change that is tagged with its own identifier, that data change is ignored, because the node knows that it has already been processed.

圆形和星形拓扑的一个问题是,如果只有一个节点发生故障,它可能会中断其他节点之间的复制消息流,导致它们无法通信,直到该节点修复为止。可以重新配置拓扑以解决故障节点,但在大多数部署中,此类重新配置必须手动完成。连接更密集的拓扑(例如全对全)的容错能力更好,因为它允许消息沿着不同的路径传输,避免单点故障。

A problem with circular and star topologies is that if just one node fails, it can interrupt the flow of replication messages between other nodes, causing them to be unable to communicate until the node is fixed. The topology could be reconfigured to work around the failed node, but in most deployments such reconfiguration would have to be done manually. The fault tolerance of a more densely connected topology (such as all-to-all) is better because it allows messages to travel along different paths, avoiding a single point of failure.

另一方面,全对全拓扑也可能存在问题。特别是,某些网络链路可能比其他网络链路更快(例如,由于网络拥塞),导致某些复制消息可能“超过”其他消息,如图 5-9所示

On the other hand, all-to-all topologies can have issues too. In particular, some network links may be faster than others (e.g., due to network congestion), with the result that some replication messages may “overtake” others, as illustrated in Figure 5-9.

直达0509
图 5-9。对于多领导者复制,写入可能会以错误的顺序到达某些副本。

图 5-9中,客户端 A 在领导者 1 上的表中插入一行,客户端 B 在领导者 3 上更新该行。但是,领导者 2 可能会以不同的顺序接收写入:它可能首先接收更新(其中,从它的角度来看,是对数据库中不存在的行的更新)并且仅在稍后接收相应的插入(应该在更新之前)。

In Figure 5-9, client A inserts a row into a table on leader 1, and client B updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may first receive the update (which, from its point of view, is an update to a row that does not exist in the database) and only later receive the corresponding insert (which should have preceded the update).

这是一个因果关系的问题,类似于我们在“一致前缀读取”中看到的问题:更新取决于之前的插入,因此我们需要确保所有节点都先处理插入,然后再处理更新。简单地为每次写入附加时间戳是不够的,因为不能相信时钟足够同步以正确排序领导者 2 上的这些事件(请参阅第 8 章)。

This is a problem of causality, similar to the one we saw in “Consistent Prefix Reads”: the update depends on the prior insert, so we need to make sure that all nodes process the insert first, and then the update. Simply attaching a timestamp to every write is not sufficient, because clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see Chapter 8).

为了正确排序这些事件,可以使用一种称为版本向量的 技术,我们将在本章后面讨论该技术(请参阅“检测并发写入”)​​。然而,冲突检测技术在许多多领导者复制系统中实施得很差。例如,在撰写本文时,PostgreSQL BDR 不提供写入的因果顺序 [ 27 ],并且 Tungsten Replicator for MySQL 甚至不尝试检测冲突 [ 34 ]。

To order these events correctly, a technique called version vectors can be used, which we will discuss later in this chapter (see “Detecting Concurrent Writes”). However, conflict detection techniques are poorly implemented in many multi-leader replication systems. For example, at the time of writing, PostgreSQL BDR does not provide causal ordering of writes [27], and Tungsten Replicator for MySQL doesn’t even try to detect conflicts [34].

如果您使用的是具有多领导者复制的系统,则值得注意这些问题,仔细阅读文档并彻底测试您的数据库,以确保它确实提供您认为具有的保证。

If you are using a system with multi-leader replication, it is worth being aware of these issues, carefully reading the documentation, and thoroughly testing your database to ensure that it really does provide the guarantees you believe it to have.

无领导者复制

Leaderless Replication

到目前为止,我们在本章中讨论的复制方法(单领导者和多领导者复制)基于这样的想法:客户端向一个节点(领导者)发送写入请求,数据库系统负责复制该请求写入其他副本。领导者决定处理写入的顺序,追随者以相同的顺序应用领导者的写入。

The replication approaches we have discussed so far in this chapter—single-leader and multi-leader replication—are based on the idea that a client sends a write request to one node (the leader), and the database system takes care of copying that write to the other replicas. A leader determines the order in which writes should be processed, and followers apply the leader’s writes in the same order.

一些数据存储系统采用不同的方法,放弃领导者的概念,并允许任何副本直接接受来自客户端的写入。一些最早的复制数据系统是无领导的 [ 1 , 44 ],但在关系数据库占主导地位的时代,这个想法基本上被遗忘了。在亚马逊将其用于其内部Dynamo系统后,它再次成为一种时尚的数据库架构[ 37 ]。vi Riak、Cassandra 和 Voldemort 都是受 Dynamo 启发而采用无领导者复制模型的开源数据存储,因此这种数据库也称为Dynamo 风格

Some data storage systems take a different approach, abandoning the concept of a leader and allowing any replica to directly accept writes from clients. Some of the earliest replicated data systems were leaderless [1, 44], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became a fashionable architecture for databases after Amazon used it for its in-house Dynamo system [37].vi Riak, Cassandra, and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style.

在一些无领导者实现中,客户端直接将其写入发送到多个副本,而在其他实现中,协调器节点代表客户端执行此操作。然而,与领导者数据库不同,该协调器不强制执行特定的写入顺序。正如我们将看到的,这种设计上的差异对数据库的使用方式有着深远的影响。

In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client. However, unlike a leader database, that coordinator does not enforce a particular ordering of writes. As we shall see, this difference in design has profound consequences for the way the database is used.

当节点关闭时写入数据库

Writing to the Database When a Node Is Down

假设您有一个包含三个副本的数据库,其中一个副本当前不可用 — 也许正在重新启动以安装系统更新。在基于领导者的配置中,如果您想继续处理写入,您可能需要执行故障转移(请参阅 “处理节点中断”)。

Imagine you have a database with three replicas, and one of the replicas is currently unavailable—perhaps it is being rebooted to install a system update. In a leader-based configuration, if you want to continue processing writes, you may need to perform a failover (see “Handling Node Outages”).

另一方面,在无领导者配置中,不存在故障转移。 图 5-10显示了所发生的情况:客户端(用户 1234)将写入并行发送到所有三个副本,两个可用副本接受该写入,但不可用副本错过了该写入。假设三个副本中有两个确认写入就足够了:在用户 1234 收到两个ok响应后,我们认为写入成功。客户端只是忽略了其中一个副本错过写入的事实。

On the other hand, in a leaderless configuration, failover does not exist. Figure 5-10 shows what happens: the client (user 1234) sends the write to all three replicas in parallel, and the two available replicas accept the write but the unavailable replica misses it. Let’s say that it’s sufficient for two out of three replicas to acknowledge the write: after user 1234 has received two ok responses, we consider the write to be successful. The client simply ignores the fact that one of the replicas missed the write.

直达0510
图 5-10。节点中断后的仲裁写入、仲裁读取和读取修复。

现在想象一下,不可用的节点重新上线,并且客户端开始从中读取数据。该节点将丢失节点关闭时发生的任何写入操作。因此,如果您从该节点读取数据,您可能会得到陈旧(过时)的值作为响应。

Now imagine that the unavailable node comes back online, and clients start reading from it. Any writes that happened while the node was down are missing from that node. Thus, if you read from that node, you may get stale (outdated) values as responses.

为了解决这个问题,当客户端从数据库读取数据时,它不仅仅将请求发送到一个副本:读取请求也会并行发送到多个节点。客户端可能会从不同的节点得到不同的响应;即,来自一个节点的最新值和来自另一节点的陈旧值。版本号用于确定哪个值较新(请参阅 “检测并发写入”)​​。

To solve that problem, when a client reads from the database, it doesn’t just send its request to one replica: read requests are also sent to several nodes in parallel. The client may get different responses from different nodes; i.e., the up-to-date value from one node and a stale value from another. Version numbers are used to determine which value is newer (see “Detecting Concurrent Writes”).

阅读修复和反熵

Read repair and anti-entropy

复制方案应确保最终所有数据都复制到每个副本。当不可用的节点重新上线后,它如何赶上它错过的写入?

The replication scheme should ensure that eventually all the data is copied to every replica. After an unavailable node comes back online, how does it catch up on the writes that it missed?

Dynamo 风格的数据存储中经常使用两种机制:

Two mechanisms are often used in Dynamo-style datastores:

读取修复
Read repair

当客户端并行读取多个节点时,它可以检测到任何过时的响应。例如,在图 5-10中,用户 2345 从副本 3 获取版本 6 值,从副本 1 和 2 获取版本 7 值。客户端发现副本 3 具有过时值,并将较新的值写回该副本。这种方法对于经常读取的值非常有效。

When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10, user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica. This approach works well for values that are frequently read.

反熵过程
Anti-entropy process

此外,某些数据存储具有后台进程,该进程不断查找副本之间的数据差异,并将任何丢失的数据从一个副本复制到另一个副本。与基于领导者的复制中的复制日志不同,此反熵过程不会以任何特定顺序复制写入,并且在复制数据之前可能会存在明显的延迟。

In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.

并非所有系统都实现这两者;例如,伏地魔目前没有反熵过程。请注意,如果没有反熵过程,很少读取的值可能会从某些副本中丢失,从而降低持久性,因为仅当应用程序读取值时才会执行读修复。

Not all systems implement both of these; for example, Voldemort currently does not have an anti-entropy process. Note that without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability, because read repair is only performed when a value is read by the application.

阅读和写作的法定人数

Quorums for reading and writing

在图 5-10 的示例中,我们认为写入是成功的,即使只在三个副本中的两个上进行了处理。如果只有三分之一的副本接受写入怎么办?我们能把这件事推到什么程度?

In the example of Figure 5-10, we considered the write to be successful even though it was only processed on two out of three replicas. What if only one out of three replicas accepted the write? How far can we push this?

如果我们知道每次成功的写入都保证至少存在于三个副本中的两个上,则意味着最多有一个副本可能是过时的。因此,如果我们从至少两个副本中读取数据,我们就可以确定这两个副本中至少有一个是最新的。如果第三个副本出现故障或响应缓慢,读取仍然可以继续返回最新值。

If we know that every successful write is guaranteed to be present on at least two out of three replicas, that means at most one replica can be stale. Thus, if we read from at least two replicas, we can be sure that at least one of the two is up to date. If the third replica is down or slow to respond, reads can nevertheless continue returning an up-to-date value.

更一般地,如果有n个副本,则每次写入必须由w个节点确认才能被认为是成功的,并且每次读取我们必须至少查询r个节点。(在我们的示例中, n  = 3,w  = 2,r  = 2。)只要w  +  r > n,我们就期望在读取时获得最新值,因为我们至少有一个r节点正在阅读的内容必须是最新的。遵守这些rw值的读取和写入称为 仲裁读取和写入 [ 44 ]。 您可以将rw视为读取或写入有效所需的最小票数。

More generally, if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes [44].vii You can think of r and w as the minimum number of votes required for the read or write to be valid.

在 Dynamo 风格的数据库中,参数nwr通常是可配置的。常见的选择是将n设为奇数(通常为 3 或 5)并设置w = r = ( n  + 1) / 2 (向上舍入)。但是,您可以根据需要更改数字。例如,写入次数少、读取次数多的工作负载可能会受益于设置w = nr = 1。这会使读取速度更快,但缺点是只有一个节点发生故障就会导致所有数据库写入失败。

In Dynamo-style databases, the parameters n, w, and r are typically configurable. A common choice is to make n an odd number (typically 3 or 5) and to set w = r = (n + 1) / 2 (rounded up). However, you can vary the numbers as you see fit. For example, a workload with few writes and many reads may benefit from setting w = n and r = 1. This makes reads faster, but has the disadvantage that just one failed node causes all database writes to fail.

笔记

集群中可能有超过n 个节点,但任何给定值仅存储在n 个 节点上。这允许对数据集进行分区,支持大于一个节点所能容纳的数据集。我们将在第 6 章中回到分区。

There may be more than n nodes in the cluster, but any given value is stored only on n nodes. This allows the dataset to be partitioned, supporting datasets that are larger than you can fit on one node. We will return to partitioning in Chapter 6.

仲裁条件w  +  r > n允许系统容忍不可用的节点,如下所示:

The quorum condition, w + r > n, allows the system to tolerate unavailable nodes as follows:

  • 如果w  <  n,如果节点不可用,我们仍然可以处理写入。

  • If w < n, we can still process writes if a node is unavailable.

  • 如果r  <  n,如果节点不可用,我们仍然可以处理读取。

  • If r < n, we can still process reads if a node is unavailable.

  • n  = 3、w  = 2、r  = 2 时,我们可以容忍一个不可用的节点。

  • With n = 3, w = 2, r = 2 we can tolerate one unavailable node.

  • n  = 5、w  = 3、r  = 3 时,我们可以容忍两个不可用的节点。这种情况如图5-11所示。

  • With n = 5, w = 3, r = 3 we can tolerate two unavailable nodes. This case is illustrated in Figure 5-11.

  • 通常,读取和写入始终并行发送到所有n 个副本。参数wr决定了我们等待多少个节点——即,在我们认为读取或写入成功之前,n个节点中有多少需要报告成功。

  • Normally, reads and writes are always sent to all n replicas in parallel. The parameters w and r determine how many nodes we wait for—i.e., how many of the n nodes need to report success before we consider the read or write to be successful.

直达0511
图 5-11。如果w  +  r > n ,则您从中读取的r个副本中至少有一个必须看到最近的成功写入。

如果可用的wr节点少于所需的数量,则写入或读取将返回错误。节点不可用的原因有很多:由于节点已关闭(崩溃、断电)、由于执行操作时出错(由于磁盘已满而无法写入)、由于客户端与客户端之间的网络中断节点,或出于任何其他原因。我们只关心节点是否返回成功的响应,不需要区分不同类型的故障。

If fewer than the required w or r nodes are available, writes or reads return an error. A node could be unavailable for many reasons: because the node is down (crashed, powered down), due to an error executing the operation (can’t write because the disk is full), due to a network interruption between the client and the node, or for any number of other reasons. We only care whether the node returned a successful response and don’t need to distinguish between different kinds of fault.

群体一致性的限制

Limitations of Quorum Consistency

如果您有n 个副本,并且选择wr使得w  +  r > n,则通常可以期望每次读取都返回为键写入的最新值。出现这种情况是因为您写入的节点集和您读取的节点集必须重叠。也就是说,你读取的节点中至少有一个节点的值是最新的( 如图5-11所示)。

If you have n replicas, and you choose w and r such that w + r > n, you can generally expect every read to return the most recent value written for a key. This is the case because the set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That is, among the nodes you read there must be at least one node with the latest value (illustrated in Figure 5-11).

通常,rw被选择为大多数(超过n /2)节点,因为这样可以确保 w  +  r > n,同时仍能容忍最多n /2 节点故障。但法定人数不一定是多数,重要的是读写操作所使用的节点集至少在一个节点中重叠。其他仲裁分配也是可能的,这使得分布式算法的设计具有一定的灵活性[ 45 ]。

Often, r and w are chosen to be a majority (more than n/2) of nodes, because that ensures w + r > n while still tolerating up to n/2 node failures. But quorums are not necessarily majorities—it only matters that the sets of nodes used by the read and write operations overlap in at least one node. Other quorum assignments are possible, which allows some flexibility in the design of distributed algorithms [45].

您还可以将wr设置为较小的数字,使得w  +  rn(即不满足法定人数条件)。在这种情况下,读取和写入仍将发送到n 个 节点,但需要较少数量的成功响应才能使操作成功。

You may also set w and r to smaller numbers, so that w + rn (i.e., the quorum condition is not satisfied). In this case, reads and writes will still be sent to n nodes, but a smaller number of successful responses is required for the operation to succeed.

使用较小的wr,您更有可能读取过时的值,因为您的读取更有可能不包含具有最新值的节点。从好的方面来说,这种配置可以降低延迟并提高可用性:如果出现网络中断并且许多副本无法访问,则您可以继续处理读取和写入的机会更大。仅当可访问副本的数量低于wr时,数据库才会分别变得不可写入或不可读取。

With a smaller w and r you are more likely to read stale values, because it’s more likely that your read didn’t include the node with the latest value. On the upside, this configuration allows lower latency and higher availability: if there is a network interruption and many replicas become unreachable, there’s a higher chance that you can continue processing reads and writes. Only after the number of reachable replicas falls below w or r does the database become unavailable for writing or reading, respectively.

然而,即使w  +  r > n,也可能会出现返回过时值的边缘情况。这些取决于实施,但可能的情况包括:

However, even with w + r > n, there are likely to be edge cases where stale values are returned. These depend on the implementation, but possible scenarios include:

  • 如果使用草率仲裁(请参阅“草率仲裁和提示切换”),则w写入可能会与r读取位于不同的节点上,因此r 节点和w节点之间不再有保证的重叠[ 46 ] 。

  • If a sloppy quorum is used (see “Sloppy Quorums and Hinted Handoff”), the w writes may end up on different nodes than the r reads, so there is no longer a guaranteed overlap between the r nodes and the w nodes [46].

  • 如果两个写入同时发生,则不清楚哪一个先发生。在这种情况下,唯一安全的解决方案是合并并发写入(请参阅“处理写入冲突”)。如果根据时间戳选择获胜者(最后一次写入获胜),则写入可能会由于时钟偏差而丢失[ 35 ]。我们将在“检测并发写入”中回到这个主题 。

  • If two writes occur concurrently, it is not clear which one happened first. In this case, the only safe solution is to merge the concurrent writes (see “Handling Write Conflicts”). If a winner is picked based on a timestamp (last write wins), writes can be lost due to clock skew [35]. We will return to this topic in “Detecting Concurrent Writes”.

  • 如果写入与读取同时发生,则写入可能仅反映在某些副本上。在这种情况下,无法确定读取返回的是旧值还是新值。

  • If a write happens concurrently with a read, the write may be reflected on only some of the replicas. In this case, it’s undetermined whether the read returns the old or the new value.

  • 如果写入在某些副本上成功,但在其他副本上失败(例如,因为某些节点上的磁盘已满),并且在少于w 个副本上总体成功,则不会在成功的副本上回滚。这意味着如果写入被报告为失败,则后续读取可能会也可能不会返回该写入的值[ 47 ]。

  • If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than w replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads may or may not return the value from that write [47].

  • 如果承载新值的节点发生故障,并且从承载旧值的副本恢复其数据,则存储新值的副本数量可能会低于w,从而破坏仲裁条件。

  • If a node carrying a new value fails, and its data is restored from a replica carrying an old value, the number of replicas storing the new value may fall below w, breaking the quorum condition.

  • 即使一切正常,也存在一些边缘情况,在这种情况下,您可能会在时机上不走运,正如我们将在“线性化和法定人数”中看到的那样。

  • Even if everything is working correctly, there are edge cases in which you can get unlucky with the timing, as we shall see in “Linearizability and quorums”.

因此,虽然仲裁看起来可以保证读取返回最新的写入值,但实际上并不是那么简单。Dynamo 风格的数据库通常针对可以容忍最终一致性的用例进行优化。参数wr允许您调整读取陈旧值的概率,但明智的做法是不要将它们视为绝对保证。

Thus, although quorums appear to guarantee that a read returns the latest written value, in practice it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency. The parameters w and r allow you to adjust the probability of stale values being read, but it’s wise to not take them as absolute guarantees.

特别是,您通常无法获得“复制滞后问题”中讨论的保证(读取您的写入、单调读取或一致的前缀读取),因此应用程序中可能会出现前面提到的异常情况。更强的保证通常需要交易或共识。我们将在第 7 章和第 9 章中再次讨论这些主题。

In particular, you usually do not get the guarantees discussed in “Problems with Replication Lag” (reading your writes, monotonic reads, or consistent prefix reads), so the previously mentioned anomalies can occur in applications. Stronger guarantees generally require transactions or consensus. We will return to these topics in Chapter 7 and Chapter 9.

监控陈旧性

Monitoring staleness

从操作角度来看,监控数据库是否返回最新结果非常重要。即使您的应用程序可以容忍过时的读取,您也需要了解复制的运行状况。如果它明显落后,它应该提醒您,以便您调查原因(例如,网络问题或节点过载)。

From an operational perspective, it’s important to monitor whether your databases are returning up-to-date results. Even if your application can tolerate stale reads, you need to be aware of the health of your replication. If it falls behind significantly, it should alert you so that you can investigate the cause (for example, a problem in the network or an overloaded node).

对于基于领导者的复制,数据库通常会公开复制延迟的指标,您可以将其输入监控系统。这是可能的,因为写入以相同的顺序应用于领导者和追随者,并且每个节点在复制日志中都有一个位置(它在本地应用的写入次数)。通过从领导者的当前位置减去追随者的当前位置,您可以测量复制滞后量。

For leader-based replication, the database typically exposes metrics for the replication lag, which you can feed into a monitoring system. This is possible because writes are applied to the leader and to followers in the same order, and each node has a position in the replication log (the number of writes it has applied locally). By subtracting a follower’s current position from the leader’s current position, you can measure the amount of replication lag.

然而,在无领导者复制的系统中,写入的应用没有固定的顺序,这使得监视变得更加困难。此外,如果数据库仅使用读修复(无反熵),则值的年龄没有限制 - 如果不经常读取某个值,则陈旧副本返回的值可能是古老的。

However, in systems with leaderless replication, there is no fixed order in which writes are applied, which makes monitoring more difficult. Moreover, if the database only uses read repair (no anti-entropy), there is no limit to how old a value might be—if a value is only infrequently read, the value returned by a stale replica may be ancient.

已经有一些研究测量具有无领导者复制的数据库中的副本陈旧性,并根据参数nwr预测陈旧读取的预期百分比[ 48 ]。不幸的是,这还不是常见的做法,但最好将过时性测量包含在数据库的标准指标集中。最终一致性是一个故意模糊的保证,但对于可操作性来说,能够量化“最终”非常重要。

There has been some research on measuring replica staleness in databases with leaderless replication and predicting the expected percentage of stale reads depending on the parameters n, w, and r [48]. This is unfortunately not yet common practice, but it would be good to include staleness measurements in the standard set of metrics for databases. Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be able to quantify “eventual.”

草率的法定人数和暗示的交接

Sloppy Quorums and Hinted Handoff

具有适当配置的仲裁的数据库可以容忍单个节点的故障,而不需要故障转移。它们还可以容忍单个节点变慢,因为请求不必等待所有n 个节点响应,它们可以在wr节点响应时返回。这些特性使得具有无领导者复制的数据库对于需要高可用性和低延迟并且可以容忍偶尔的陈旧读取的用例很有吸引力。

Databases with appropriately configured quorums can tolerate the failure of individual nodes without the need for failover. They can also tolerate individual nodes going slow, because requests don’t have to wait for all n nodes to respond—they can return when w or r nodes have responded. These characteristics make databases with leaderless replication appealing for use cases that require high availability and low latency, and that can tolerate occasional stale reads.

然而,法定人数(如目前所描述的)并不具有应有的容错能力。网络中断可以轻松地切断客户端与大量数据库节点的连接。尽管这些节点还活着,并且其他客户端可能能够连接到它们,但对于与数据库节点断开的客户端来说,它们可能已经死了。在这种情况下,剩余的可到达节点可能少于wr ,因此客户端无法再达到法定人数。

However, quorums (as described so far) are not as fault-tolerant as they could be. A network interruption can easily cut off a client from a large number of database nodes. Although those nodes are alive, and other clients may be able to connect to them, to a client that is cut off from the database nodes, they might as well be dead. In this situation, it’s likely that fewer than w or r reachable nodes remain, so the client can no longer reach a quorum.

在大型集群(节点数明显多于n个)中,客户端可能在网络中断期间连接到某些数据库节点,但无法连接到为特定值组装仲裁所需的节点。在这种情况下,数据库设计者面临一个权衡:

In a large cluster (with significantly more than n nodes) it’s likely that the client can connect to some database nodes during the network interruption, just not to the nodes that it needs to assemble a quorum for a particular value. In that case, database designers face a trade-off:

  • 向所有无法达到wr 节点法定数量的请求返回错误是否更好?

  • Is it better to return errors to all requests for which we cannot reach a quorum of w or r nodes?

  • 或者我们是否应该接受写入,并将它们写入一些可访问但不在该值通常所在的n 个节点中的节点?

  • Or should we accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?

后者被称为草率仲裁 [ 37 ]:写入和读取仍然需要wr成功响应,但这些可能包括不在 某个值的指定n 个“主”节点之中的节点。打个比方,如果你把自己锁在门外,你可以敲邻居的门,询问你是否可以暂时住在他们的沙发上。

The latter is known as a sloppy quorum [37]: writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value. By analogy, if you lock yourself out of your house, you may knock on the neighbor’s door and ask whether you may stay on their couch temporarily.

一旦网络中断修复,一个节点代表另一节点临时接受的任何写入都会被发送到适当的“主”节点。这称为暗示切换。(一旦你再次找到房子的钥匙,你的邻居就会礼貌地请你离开他们的沙发回家。)

Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes. This is called hinted handoff. (Once you find the keys to your house again, your neighbor politely asks you to get off their couch and go home.)

草率的仲裁对于提高写入可用性特别有用:只要任何 w节点可用,数据库就可以接受写入。然而,这意味着即使当 w  +  r > n时,你也不能确定读取到某个键的最新值,因为最新值可能已被临时写入n [ 47 ] 之外的某些节点。

Sloppy quorums are particularly useful for increasing write availability: as long as any w nodes are available, the database can accept writes. However, this means that even when w + r > n, you cannot be sure to read the latest value for a key, because the latest value may have been temporarily written to some nodes outside of n [47].

因此,草率的法定人数实际上根本不是传统意义上的法定人数。它只是持久性的保证,即数据存储在某个地方的w节点上。在暗示的切换完成之前,不能保证r节点的读取会看到它。

Thus, a sloppy quorum actually isn’t a quorum at all in the traditional sense. It’s only an assurance of durability, namely that the data is stored on w nodes somewhere. There is no guarantee that a read of r nodes will see it until the hinted handoff has completed.

在所有常见的 Dynamo 实现中,草率的仲裁都是可选的。在 Riak 中,它们默认启用,而在 Cassandra 和 Voldemort 中,它们默认禁用 [ 46 , 49 , 50 ]。

Sloppy quorums are optional in all common Dynamo implementations. In Riak they are enabled by default, and in Cassandra and Voldemort they are disabled by default [46, 49, 50].

多数据中心运行

Multi-datacenter operation

我们之前讨论了跨数据中心复制作为多领导者复制的用例(请参阅 “多领导者复制”)。无领导者复制也适用于多数据中心操作,因为它旨在容忍冲突的并发写入、网络中断和延迟峰值。

We previously discussed cross-datacenter replication as a use case for multi-leader replication (see “Multi-Leader Replication”). Leaderless replication is also suitable for multi-datacenter operation, since it is designed to tolerate conflicting concurrent writes, network interruptions, and latency spikes.

Cassandra 和 Voldemort 在正常的无领导者模型中实现了多数据中心支持:副本数量n包括所有数据中心中的节点,并且在配置中您可以指定每个数据中心中希望拥有n个副本中的多少个。来自客户端的每次写入都会发送到所有副本,而不管数据中心如何,但客户端通常只等待其本地数据中心内的法定节点的确认,因此不会受到跨数据中心链路上的延迟和中断的影响。尽管配置具有一定的灵活性,但对其他数据中心的高延迟写入通常配置为异步发生[ 50 , 51 ]。

Cassandra and Voldemort implement their multi-datacenter support within the normal leaderless model: the number of replicas n includes nodes in all datacenters, and in the configuration you can specify how many of the n replicas you want to have in each datacenter. Each write from a client is sent to all replicas, regardless of datacenter, but the client usually only waits for acknowledgment from a quorum of nodes within its local datacenter so that it is unaffected by delays and interruptions on the cross-datacenter link. The higher-latency writes to other datacenters are often configured to happen asynchronously, although there is some flexibility in the configuration [50, 51].

Riak 将客户端和数据库节点之间的所有通信保留在一个数据中心本地,因此n 描述了一个数据中心内的副本数量。数据库集群之间的跨数据中心复制在后台异步发生,其风格类似于多领导者复制[ 52 ]。

Riak keeps all communication between clients and database nodes local to one datacenter, so n describes the number of replicas within one datacenter. Cross-datacenter replication between database clusters happens asynchronously in the background, in a style that is similar to multi-leader replication [52].

检测并发写入

Detecting Concurrent Writes

Dynamo 风格的数据库允许多个客户端同时写入同一个密钥,这意味着即使使用严格的仲裁也会发生冲突。这种情况类似于多主复制(请参阅“处理写入冲突”),尽管在 Dynamo 风格的数据库中,在读取修复或暗示切换期间也可能会出现冲突。

Dynamo-style databases allow several clients to concurrently write to the same key, which means that conflicts will occur even if strict quorums are used. The situation is similar to multi-leader replication (see “Handling Write Conflicts”), although in Dynamo-style databases conflicts can also arise during read repair or hinted handoff.

问题在于,由于可变的网络延迟和部分故障,事件可能以不同的顺序到达不同的节点。例如,图 5-12显示了两个客户端 A 和 B 同时写入三节点数据存储中的键X :

The problem is that events may arrive in a different order at different nodes, due to variable network delays and partial failures. For example, Figure 5-12 shows two clients, A and B, simultaneously writing to a key X in a three-node datastore:

  • 节点 1 接收到来自 A 的写入,但由于短暂中断而从未接收到来自 B 的写入。

  • Node 1 receives the write from A, but never receives the write from B due to a transient outage.

  • 节点 2 首先接收来自 A 的写入,然后接收来自 B 的写入。

  • Node 2 first receives the write from A, then the write from B.

  • 节点 3 首先接收来自 B 的写入,然后接收来自 A 的写入。

  • Node 3 first receives the write from B, then the write from A.

直达0512
图 5-12。Dynamo 风格的数据存储中的并发写入:没有明确定义的顺序。

如果每个节点在收到客户端的写请求时都简单地覆盖某个键的值,则节点将变得永久不一致,如图5-12中的最终get请求 所示:节点 2 认为X的最终值为B,而其他节点认为该值为A。

If each node simply overwrote the value for a key whenever it received a write request from a client, the nodes would become permanently inconsistent, as shown by the final get request in Figure 5-12: node 2 thinks that the final value of X is B, whereas the other nodes think that the value is A.

为了最终保持一致,副本应该收敛到相同的值。他们是怎么做到的?人们可能希望复制数据库能够自动处理这个问题,但不幸的是,大多数实现都非常糟糕:如果您想避免丢失数据,您(应用程序开发人员)需要了解很多有关数据库冲突处理的内部结构。

In order to become eventually consistent, the replicas should converge toward the same value. How do they do that? One might hope that replicated databases would handle this automatically, but unfortunately most implementations are quite poor: if you want to avoid losing data, you—the application developer—need to know a lot about the internals of your database’s conflict handling.

我们在“处理写入冲突”中简要介绍了一些解决冲突的技术 。在结束本章之前,让我们更详细地探讨这个问题。

We briefly touched on some techniques for conflict resolution in “Handling Write Conflicts”. Before we wrap up this chapter, let’s explore the issue in a bit more detail.

最后写入获胜(丢弃并发写入)

Last write wins (discarding concurrent writes)

实现最终收敛的一种方法是声明每个副本只需要存储最新的值并允许覆盖和丢弃“较旧”的值。然后,只要我们有某种方法明确确定哪个写入更“最近”,并且每次写入最终都会复制到每个副本,副本最终将收敛到相同的值。

One approach for achieving eventual convergence is to declare that each replica need only store the most “recent” value and allow “older” values to be overwritten and discarded. Then, as long as we have some way of unambiguously determining which write is more “recent,” and every write is eventually copied to every replica, the replicas will eventually converge to the same value.

正如“最近”周围的引文所示,这个想法实际上是相当具有误导性的。在图 5-12的示例中,当客户端向数据库节点发送写请求时,两个客户端都不知道对方,因此不清楚哪个客户端先发生。事实上,说这两种情况“首先”发生并没有什么意义:我们说写入是并发的,因此它们的顺序是未定义的。

As indicated by the quotes around “recent,” this idea is actually quite misleading. In the example of Figure 5-12, neither client knew about the other one when it sent its write requests to the database nodes, so it’s not clear which one happened first. In fact, it doesn’t really make sense to say that either happened “first”: we say the writes are concurrent, so their order is undefined.

即使写入没有自然顺序,我们也可以对它们强制执行任意顺序。例如,我们可以为每次写入附加一个时间戳,选择最大的时间戳作为最新的,并丢弃具有较早时间戳的任何写入。这种冲突解决算法称为最后写入获胜(LWW),是 Cassandra [ 53 ]中唯一支持的冲突解决方法,也是 Riak [ 35 ]中的可选功能。

Even though the writes don’t have a natural ordering, we can force an arbitrary order on them. For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent,” and discard any writes with an earlier timestamp. This conflict resolution algorithm, called last write wins (LWW), is the only supported conflict resolution method in Cassandra [53], and an optional feature in Riak [35].

LWW 实现了最终收敛的目标,但以持久性为代价:如果对同一个 key 有多个并发写入,即使它们都向客户端报告为成功(因为它们写入了 w 个副本),也只有其中之一写入的内容将保留下来,而其他内容将被默默地丢弃。此外,LWW 甚至可能会丢弃非并发的写入,正如我们将在“排序事件的时间戳”中讨论的那样。

LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded. Moreover, LWW may even drop writes that are not concurrent, as we shall discuss in “Timestamps for ordering events”.

在某些情况下,例如缓存,丢失的写入可能是可以接受的。如果丢失数据是不可接受的,那么 LWW 并不是解决冲突的最佳选择。

There are some situations, such as caching, in which lost writes are perhaps acceptable. If losing data is not acceptable, LWW is a poor choice for conflict resolution.

将数据库与 LWW 结合使用的唯一安全方法是确保密钥仅写入一次,然后将其视为不可变,从而避免对同一密钥进行任何并发更新。例如,使用 Cassandra 的推荐方法是使用 UUID 作为密钥,从而为每个写入操作提供唯一的密钥 [ 53 ]。

The only safe way of using a database with LWW is to ensure that a key is only written once and thereafter treated as immutable, thus avoiding any concurrent updates to the same key. For example, a recommended way of using Cassandra is to use a UUID as the key, thus giving each write operation a unique key [53].

“先发生”关系和并发性

The “happens-before” relationship and concurrency

我们如何判断两个操作是否并发?为了培养直觉,让我们看一些例子:

How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look at some examples:

  • 图5-9中,两个写入不是并发的:A的插入发生在 B的增量之前,因为B增量的值是A插入的值。换句话说,B的操作建立在A的操作之上,所以B的操作一定已经发生之后。我们还说 B因果依赖于 A。

  • In Figure 5-9, the two writes are not concurrent: A’s insert happens before B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s operation builds upon A’s operation, so B’s operation must have happened later. We also say that B is causally dependent on A.

  • 另一方面,图5-12中的两个写入是并发的:当每个客户端开始操作时,它不知道另一个客户端也在对同一键执行操作。因此,操作之间不存在因果关系。

  • On the other hand, the two writes in Figure 5-12 are concurrent: when each client starts the operation, it does not know that another client is also performing an operation on the same key. Thus, there is no causal dependency between the operations.

如果 B 了解 A、或者依赖于 A、或者以某种方式构建在 A 之上,则操作 A 在另一个操作 B之前发生。一个操作是否发生在另一操作之前是定义并发含义的关键。事实上,我们可以简单地说,如果两个操作都不发生在另一个操作之前(即,两个操作都不知道另一个操作),则两个操作是并发的[ 54 ]。

An operation A happens before another operation B if B knows about A, or depends on A, or builds upon A in some way. Whether one operation happens before another operation is the key to defining what concurrency means. In fact, we can simply say that two operations are concurrent if neither happens before the other (i.e., neither knows about the other) [54].

因此,每当有两个操作 A 和 B 时,就会存在三种可能性:A 发生在 B 之前,或者 B 发生在 A 之前,或者 A 和 B 是并发的。我们需要的是一种算法来告诉我们两个操作是否并发。如果一个操作发生在另一个操作之前,则后面的操作应该覆盖前面的操作,但如果这些操作是并发的,则需要解决冲突。

Thus, whenever you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us whether two operations are concurrent or not. If one operation happened before another, the later operation should overwrite the earlier operation, but if the operations are concurrent, we have a conflict that needs to be resolved.

捕获发生之前的关系

Capturing the happens-before relationship

让我们看一下一种算法,该算法确定两个操作是否并发,或者一个操作是否先于另一个操作发生。为了简单起见,我们从一个只有一个副本的数据库开始。一旦我们弄清楚如何在单个副本上执行此操作,我们就可以将该方法推广到具有多个副本的无领导数据库。

Let’s look at an algorithm that determines whether two operations are concurrent, or whether one happened before another. To keep things simple, let’s start with a database that has only one replica. Once we have worked out how to do this on a single replica, we can generalize the approach to a leaderless database with multiple replicas.

图 5-13显示了两个客户端同时向同一购物车添加商品的情况。(如果这个例子让你觉得太愚蠢了,想象一下两个空中交通管制员同时将飞机添加到他们正在跟踪的扇区。)最初,购物车是空的。其中,客户端对数据库进行了五次写入:

Figure 5-13 shows two clients concurrently adding items to the same shopping cart. (If that example strikes you as too inane, imagine instead two air traffic controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is empty. Between them, the clients make five writes to the database:

  1. 客户 1 添加milk到购物车。这是对该密钥的第一次写入,因此服务器成功存储它并为其分配版本 1。服务器还将该值以及版本号回显给客户端。

  2. Client 1 adds milk to the cart. This is the first write to that key, so the server successfully stores it and assigns it version 1. The server also echoes the value back to the client, along with the version number.

  3. 客户端 2 添加eggs到购物车,但不知道客户端 1 同时添加了milk(客户端 2 认为它eggs是购物车中的唯一商品)。服务器将版本 2 分配给此写入,并将 和 存储eggsmilk两个单独的值。然后,它将两个值以及版本号 2 返回给客户端。

  4. Client 2 adds eggs to the cart, not knowing that client 1 concurrently added milk (client 2 thought that its eggs were the only item in the cart). The server assigns version 2 to this write, and stores eggs and milk as two separate values. It then returns both values to the client, along with the version number of 2.

  5. 客户端 1 不知道客户端 2 的写入,想要添加flour到购物车,因此它认为当前购物车内容应该是[milk, flour]。它将这个值连同服务器之前为客户端 1 提供的版本号 1 一起发送到服务器。服务器可以从版本号看出 的写入[milk, flour]取代了先前的值[milk],但它与 并发[eggs]。因此,服务器将版本 3 分配给[milk, flour],覆盖版本 1 值[milk],但保留版本 2 值[eggs]并将两个剩余值返回给客户端。

  6. Client 1, oblivious to client 2’s write, wants to add flour to the cart, so it thinks the current cart contents should be [milk, flour]. It sends this value to the server, along with the version number 1 that the server gave client 1 previously. The server can tell from the version number that the write of [milk, flour] supersedes the prior value of [milk] but that it is concurrent with [eggs]. Thus, the server assigns version 3 to [milk, flour], overwrites the version 1 value [milk], but keeps the version 2 value [eggs] and returns both remaining values to the client.

  7. 同时,客户 2 想要添加ham到购物车,但不知道客户 1 刚刚添加了flour[milk]客户端 2 在上次响应中从[eggs]服务器接收了这两个值,因此客户端现在合并这些值并相加以ham形成新值[eggs, milk, ham]。它将该值与之前的版本号 2 一起发送到服务器。服务器检测到版本 2 覆盖[eggs]但与 并发[milk, flour],因此剩下的两个值属于[milk, flour]版本 3 和[eggs, milk, ham]版本 4。

  8. Meanwhile, client 2 wants to add ham to the cart, unaware that client 1 just added flour. Client 2 received the two values [milk] and [eggs] from the server in the last response, so the client now merges those values and adds ham to form a new value, [eggs, milk, ham]. It sends that value to the server, along with the previous version number 2. The server detects that version 2 overwrites [eggs] but is concurrent with [milk, flour], so the two remaining values are [milk, flour] with version 3, and [eggs, milk, ham] with version 4.

  9. 最后,客户端 1 想要添加bacon. 它之前从版本 3 的服务器收到[milk, flour]和,因此它合并这些,添加,并将最终值 与版本号 3 一起发送到服务器。这会覆盖 (请注意,在最后一步中已经被覆盖),但是是并发的,因此服务器保留这两个并发值。[eggs]bacon[milk, flour, eggs, bacon][milk, flour][eggs][eggs, milk, ham]

  10. Finally, client 1 wants to add bacon. It previously received [milk, flour] and [eggs] from the server at version 3, so it merges those, adds bacon, and sends the final value [milk, flour, eggs, bacon] to the server, along with the version number 3. This overwrites [milk, flour] (note that [eggs] was already overwritten in the last step) but is concurrent with [eggs, milk, ham], so the server keeps those two concurrent values.

直达0513
图 5-13。捕获同时编辑购物车的两个客户端之间的因果依赖关系。

图 5-14中以图形方式说明了图 5-13中的操作之间的数据流。箭头指示哪个操作 发生在哪个其他操作之前,从某种意义上说,后一个操作知道依赖于前一个操作。在此示例中,客户端永远不会完全更新服务器上的数据,因为总是有另一个操作同时进行。但旧版本的值最终会被覆盖,并且不会丢失任何写入内容。

The dataflow between the operations in Figure 5-13 is illustrated graphically in Figure 5-14. The arrows indicate which operation happened before which other operation, in the sense that the later operation knew about or depended on the earlier one. In this example, the clients are never fully up to date with the data on the server, since there is always another operation going on concurrently. But old versions of the value do get overwritten eventually, and no writes are lost.

直达0514
图 5-14。因果关系图见图5-13

请注意,服务器可以通过查看版本号来确定两个操作是否并发 - 它不需要解释值本身(因此该值可以是任何数据结构)。该算法的工作原理如下:

Note that the server can determine whether two operations are concurrent by looking at the version numbers—it does not need to interpret the value itself (so the value could be any data structure). The algorithm works as follows:

  • 服务器为每个密钥维护一个版本号,每次写入该密钥时都会增加版本号,并将新版本号与写入的值一起存储。

  • The server maintains a version number for every key, increments the version number every time that key is written, and stores the new version number along with the value written.

  • 当客户端读取密钥时,服务器返回所有未被覆盖的值,以及最新的版本号。客户端在写入之前必须读取密钥

  • When a client reads a key, the server returns all values that have not been overwritten, as well as the latest version number. A client must read a key before writing.

  • 当客户端写入密钥时,它必须包含先前读取的版本号,并且必须将先前读取中收到的所有值合并在一起。(写入请求的响应可以像读取一样,返回所有当前值,这允许我们像购物车示例一样链接多个写入。)

  • When a client writes a key, it must include the version number from the prior read, and it must merge together all values that it received in the prior read. (The response from a write request can be like a read, returning all current values, which allows us to chain several writes like in the shopping cart example.)

  • 当服务器收到具有特定版本号的写入时,它可以覆盖具有该版本号或更低版本号的所有值(因为它知道它们已合并到新值中),但它必须保留具有更高版本号的所有值(因为这些值与传入写入是并发的)。

  • When the server receives a write with a particular version number, it can overwrite all values with that version number or below (since it knows that they have been merged into the new value), but it must keep all values with a higher version number (because those values are concurrent with the incoming write).

当写入包含先前读取的版本号时,这会告诉我们写入基于哪个先前状态。如果您进行写入时不包含版本号,则它将与所有其他写入同时进行,因此它不会覆盖任何内容 - 它只会作为后续读取时的值之一返回。

When a write includes the version number from a prior read, that tells us which previous state the write is based on. If you make a write without including a version number, it is concurrent with all other writes, so it will not overwrite anything—it will just be returned as one of the values on subsequent reads.

合并同时写入的值

Merging concurrently written values

该算法确保不会默默地删除任何数据,但不幸的是,它要求客户端做一些额外的工作:如果多个操作同时发生,则客户端必须通过合并同时写入的值来进行清理。Riak 将这些并发值称为 “兄弟”

This algorithm ensures that no data is silently dropped, but it unfortunately requires that the clients do some extra work: if several operations happen concurrently, clients have to clean up afterward by merging the concurrently written values. Riak calls these concurrent values siblings.

合并同级值本质上与多领导者复制中的冲突解决问题相同,我们之前讨论过(请参阅“处理写入冲突”)。一种简单的方法是根据版本号或时间戳选择其中一个值(最后一次写入获胜),但这意味着会丢失数据。因此,您可能需要在应用程序代码中做一些更智能的事情。

Merging sibling values is essentially the same problem as conflict resolution in multi-leader replication, which we discussed previously (see “Handling Write Conflicts”). A simple approach is to just pick one of the values based on a version number or timestamp (last write wins), but that implies losing data. So, you may need to do something more intelligent in application code.

以购物车为例,合并兄弟姐妹的合理方法是采用并集。在图 5-14中,最后两个兄弟是[milk, flour, eggs, bacon][eggs, milk, ham];请注意,milk和 都eggs出现在两者中,即使它们只被写了一次。合并后的值可能类似于[milk, flour, eggs, bacon, ham],没有重复项。

With the example of a shopping cart, a reasonable approach to merging siblings is to just take the union. In Figure 5-14, the two final siblings are [milk, flour, eggs, bacon] and [eggs, milk, ham]; note that milk and eggs appear in both, even though they were each only written once. The merged value might be something like [milk, flour, eggs, bacon, ham], without duplicates.

但是,如果您希望允许人们从购物车中删除物品,而不仅仅是添加物品,那么采用兄弟姐妹的并集可能不会产生正确的结果:如果您合并两个兄弟购物车,并且仅在一个中删除了一个物品其中,被删除的项目将重新出现在兄弟姐妹的并集中[ 37 ]。为了防止这个问题,当一个项目被删除时,不能简单地从数据库中删除它;相反,系统必须留下一个带有适当版本号的标记,以表明该项目在合并兄弟项目时已被删除。这种删除标记称为墓碑(我们之前在“哈希索引”中的日志压缩上下文中看到过墓碑。)

However, if you want to allow people to also remove things from their carts, and not just add things, then taking the union of siblings may not yield the right result: if you merge two sibling carts and an item has been removed in only one of them, then the removed item will reappear in the union of the siblings [37]. To prevent this problem, an item cannot simply be deleted from the database when it is removed; instead, the system must leave a marker with an appropriate version number to indicate that the item has been removed when merging siblings. Such a deletion marker is known as a tombstone. (We previously saw tombstones in the context of log compaction in “Hash Indexes”.)

由于在应用程序代码中合并同级是复杂且容易出错的,因此人们努力设计可以自动执行此合并的数据结构,如“ 自动冲突解决”中所述。例如,Riak 的数据类型支持使用一系列称为 CRDT [38、39、55] 的数据结构可以 以合理的方式自动合并同级,包括保留删除

As merging siblings in application code is complex and error-prone, there are some efforts to design data structures that can perform this merging automatically, as discussed in “Automatic Conflict Resolution”. For example, Riak’s datatype support uses a family of data structures called CRDTs [38, 39, 55] that can automatically merge siblings in sensible ways, including preserving deletions.

版本向量

Version vectors

图 5-13 中的示例仅使用单个副本。当有多个副本但没有领导者时,算法会如何变化?

The example in Figure 5-13 used only a single replica. How does the algorithm change when there are multiple replicas, but no leader?

图 5-13使用单个版本号来捕获操作之间的依赖关系,但当有多个副本同时接受写入时,这还不够。相反,我们需要为每个副本和每个密钥使用版本号。每个副本在处理写入时都会增加自己的版本号,并且还会跟踪它从其他每个副本看到的版本号。此信息指示要覆盖哪些值以及将哪些值保留为同级值。

Figure 5-13 uses a single version number to capture dependencies between operations, but that is not sufficient when there are multiple replicas accepting writes concurrently. Instead, we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas. This information indicates which values to overwrite and which values to keep as siblings.

所有副本的版本号集合称为版本向量 [ 56 ]。这个想法的一些变体正在使用中,但最有趣的可能是点版本向量 [ 57 ],它在 Riak 2.0 [ 58 , 59 ]中使用。我们不会详细介绍,但它的工作方式与我们在购物车示例中看到的非常相似。

The collection of version numbers from all the replicas is called a version vector [56]. A few variants of this idea are in use, but the most interesting is probably the dotted version vector [57], which is used in Riak 2.0 [58, 59]. We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.

与图 5-13 中的版本号一样,版本向量在读取值时从数据库副本发送到客户端,并在随后写入值时需要发送回数据库。(Riak 将版本向量编码为字符串,称为因果上下文。)版本向量允许数据库区分覆盖和并发写入。

Like the version numbers in Figure 5-13, version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls causal context.) The version vector allows the database to distinguish between overwrites and concurrent writes.

此外,与单副本示例一样,应用程序可能需要合并同级。版本向量结构确保从一个副本读取并随后写回另一副本是安全的。这样做可能会导致创建同级,但只要正确合并同级,就不会丢失数据。

Also, like in the single-replica example, the application may need to merge siblings. The version vector structure ensures that it is safe to read from one replica and subsequently write back to another replica. Doing so may result in siblings being created, but no data is lost as long as siblings are merged correctly.

版本向量和向量时钟

Version vectors and vector clocks

版本向量有时也称为向量时钟,尽管它们并不完全相同。差异很微妙——请参阅参考文献以了解详细信息 [ 57 , 60 , 61 ]。简而言之,在比较副本的状态时,版本向量是正确使用的数据结构。

A version vector is sometimes also called a vector clock, even though they are not quite the same. The difference is subtle—please see the references for details [57, 60, 61]. In brief, when comparing the state of replicas, version vectors are the right data structure to use.

概括

Summary

在本章中,我们研究了复制问题。复制可以用于多种目的:

In this chapter we looked at the issue of replication. Replication can serve several purposes:

高可用性
High availability

即使一台机器(或多台机器,或整个数据中心)出现故障,也能保持系统运行

Keeping the system running, even when one machine (or several machines, or an entire datacenter) goes down

断线操作
Disconnected operation

允许应用程序在网络中断时继续工作

Allowing an application to continue working when there is a network interruption

潜伏
Latency

将数据放置在靠近用户的地理位置,以便用户可以更快地与之交互

Placing data geographically close to users, so that users can interact with it faster

可扩展性
Scalability

通过在副本上执行读取,能够处理比单台机器可以处理的更大的读取量

Being able to handle a higher volume of reads than a single machine could handle, by performing reads on replicas

尽管目标很简单(在多台机器上保留相同数据的副本),但复制却是一个非常棘手的问题。它需要仔细考虑并发性和所有可能出错的事情,并处理这些错误的后果。至少,我们需要处理不可用的节点和网络中断(这甚至没有考虑更隐蔽的故障类型,例如由于软件错误导致的静默数据损坏)。

Despite being a simple goal—keeping a copy of the same data on several machines—replication turns out to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we need to deal with unavailable nodes and network interruptions (and that’s not even considering the more insidious kinds of fault, such as silent data corruption due to software bugs).

我们讨论了三种主要的复制方法:

We discussed three main approaches to replication:

单领导者复制
Single-leader replication

客户端将所有写入发送到单个节点(领导者),该节点将数据更改事件流发送到其他副本(追随者)。可以在任何副本上执行读取,但从追随者读取的数据可能已过时。

Clients send all writes to a single node (the leader), which sends a stream of data change events to the other replicas (followers). Reads can be performed on any replica, but reads from followers might be stale.

多领导者复制
Multi-leader replication

客户端将每个写入发送到多个领导节点之一,其中任何一个都可以接受写入。领导者向彼此以及任何追随者节点发送数据更改事件流。

Clients send each write to one of several leader nodes, any of which can accept writes. The leaders send streams of data change events to each other and to any follower nodes.

无领导者复制
Leaderless replication

客户端将每次写入发送到多个节点,并并行从多个节点读取,以便检测和纠正具有陈旧数据的节点。

Clients send each write to several nodes, and read from several nodes in parallel in order to detect and correct nodes with stale data.

每种方法都有优点和缺点。单领导者复制很受欢迎,因为它相当容易理解,并且无需担心冲突解决。在存在故障节点、网络中断和延迟峰值的情况下,多领导者和无领导者复制可以更加稳健,但代价是更难以推理并且仅提供非常弱的一致性保证。

Each approach has advantages and disadvantages. Single-leader replication is popular because it is fairly easy to understand and there is no conflict resolution to worry about. Multi-leader and leaderless replication can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the cost of being harder to reason about and providing only very weak consistency guarantees.

复制可以是同步的,也可以是异步的,这对发生故障时的系统行为有着深远的影响。尽管在系统平稳运行时异步复制可以很快,但重要的是要弄清楚当复制延迟增加并且服务器发生故障时会发生什么。如果领导者失败并且您将异步更新的追随者提升为新的领导者,则最近提交的数据可能会丢失。

Replication can be synchronous or asynchronous, which has a profound effect on the system behavior when there is a fault. Although asynchronous replication can be fast when the system is running smoothly, it’s important to figure out what happens when replication lag increases and servers fail. If a leader fails and you promote an asynchronously updated follower to be the new leader, recently committed data may be lost.

我们研究了复制延迟可能导致的一些奇怪的影响,并讨论了一些一致性模型,这些模型有助于确定应用程序在复制延迟下应如何表现:

We looked at some strange effects that can be caused by replication lag, and we discussed a few consistency models which are helpful for deciding how an application should behave under replication lag:

写后读一致性
Read-after-write consistency

用户应该始终看到他们自己提交的数据。

Users should always see data that they submitted themselves.

单调读取
Monotonic reads

用户查看某一时间点的数据后,他们不应再看到某个较早时间点的数据。

After users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time.

一致的前缀读取
Consistent prefix reads

用户应该以具有因果意义的状态查看数据:例如,以正确的顺序查看问题及其答复。

Users should see the data in a state that makes causal sense: for example, seeing a question and its reply in the correct order.

最后,我们讨论了多领导者和无领导者复制方法固有的并发问题:因为它们允许同时发生多个写入,所以可能会发生冲突。我们研究了一种算法,数据库可以使用该算法来确定一个操作是否发生在另一个操作之前,或者它们是否同时发生。我们还讨论了通过合并并发更新来解决冲突的方法。

Finally, we discussed the concurrency issues that are inherent in multi-leader and leaderless replication approaches: because they allow multiple writes to happen concurrently, conflicts may occur. We examined an algorithm that a database might use to determine whether one operation happened before another, or whether they happened concurrently. We also touched on methods for resolving conflicts by merging together concurrent updates.

在下一章中,我们将继续通过复制的对应方式来研究分布在多台机器上的数据:将大型数据集拆分为分区

In the next chapter we will continue looking at data that is distributed across multiple machines, through the counterpart of replication: splitting a large dataset into partitions.

脚注

i不同的人对热备服务器、温备服务器冷备服务器有不同的定义。例如,在 PostgreSQL 中,热备用指的是接受来自客户端的读取的副本,而热备用则处理来自领导者的更改,但不处理来自客户端的任何查询。就本书而言,差异并不重要。

i Different people have different definitions for hot, warm, and cold standby servers. In PostgreSQL, for example, hot standby is used to refer to a replica that accepts reads from clients, whereas a warm standby processes changes from the leader but doesn’t process any queries from clients. For purposes of this book, the difference isn’t important.

ii这种方法被称为 “fencing”,或者更强调地说,“ Shoot The Other Node In The Head (STONITH)”。我们将在“领导者和锁”中更详细地讨论击剑。

ii This approach is known as fencing or, more emphatically, Shoot The Other Node In The Head (STONITH). We will discuss fencing in more detail in “The leader and the lock”.

iii最终一致性一词由 Douglas Terry 等人创造。[ 24 ],由 Werner Vogels [ 22 ] 推广,并成为许多 NoSQL 项目的战斗口号。然而,最终一致的不仅是 NoSQL 数据库:异步复制关系数据库中的追随者也具有相同的特征。

iii The term eventual consistency was coined by Douglas Terry et al. [24], popularized by Werner Vogels [22], and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually consistent: followers in an asynchronously replicated relational database have the same characteristics.

iv如果数据库是分区的(参见 第 6 章),则每个分区都有一个领导者。不同分区的领导者可以位于不同的节点上,但每个分区必须有一个领导者节点。

iv If the database is partitioned (see Chapter 6), each partition has one leader. Different partitions may have their leaders on different nodes, but each partition must nevertheless have one leader node.

v不要与 星型模式混淆(请参阅“星型和雪花:分析模式”),星型模式描述数据模型的结构,而不是节点之间的通信拓扑。

v Not to be confused with a star schema (see “Stars and Snowflakes: Schemas for Analytics”), which describes the structure of a data model, not the communication topology between nodes.

vi Dynamo 不适用于 Amazon 以外的用户。令人困惑的是,AWS 提供了一个名为DynamoDB的托管数据库产品,它使用完全不同的架构:它基于单主复制。

vi Dynamo is not available to users outside of Amazon. Confusingly, AWS offers a hosted database product called DynamoDB, which uses a completely different architecture: it is based on single-leader replication.

有时这种法定人数被称为严格法定人数,以与马虎法定人数形成对比(在“马虎法定人数和提示切换” 中讨论)。

vii Sometimes this kind of quorum is called a strict quorum, to contrast with sloppy quorums (discussed in “Sloppy Quorums and Hinted Handoff”).

参考

[ 1 ] Bruce G. Lindsay、Patricia Griffiths Selinger、C. Galtieri 等人:“分布式数据库注释”,IBM Research,研究报告 RJ2571(33471),1979 年 7 月。

[1] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “Notes on Distributed Databases,” IBM Research, Research Report RJ2571(33471), July 1979.

[ 2 ]“ Oracle Active Data Guard 实时数据保护和可用性”,Oracle 白皮书,2013 年 6 月。

[2] “Oracle Active Data Guard Real-Time Data Protection and Availability,” Oracle White Paper, June 2013.

[ 3 ]“ AlwaysOn 可用性组”,SQL Server 联机丛书,Microsoft,2012 年。

[3] “AlwaysOn Availability Groups,” in SQL Server Books Online, Microsoft, 2012.

[ 4 ] Lin Qiao、Kapil Surlaker、Shirshanka Das 等人:“ On Brewing Fresh Espresso:LinkedIn 的分布式数据服务平台”,ACM 国际数据管理会议(SIGMOD),2013 年 6 月。

[4] Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[ 5 ] Jun Rao:“ Apache Kafka 的集群内复制”,ApacheCon 北美,2013 年 2 月。

[5] Jun Rao: “Intra-Cluster Replication for Apache Kafka,” at ApacheCon North America, February 2013.

[ 6 ]“高可用队列”,参见RabbitMQ 服务器文档,Pivotal Software, Inc.,2014 年。

[6] “Highly Available Queues,” in RabbitMQ Server Documentation, Pivotal Software, Inc., 2014.

[ 7 ] Yoshinori Matsunobu:“ Facebook 的半同步复制”,yoshinorimatsunobu.blogspot.co.uk,2014 年 4 月 1 日。

[7] Yoshinori Matsunobu: “Semi-Synchronous Replication at Facebook,” yoshinorimatsunobu.blogspot.co.uk, April 1, 2014.

[ 8 ] Robbert van Renesse 和 Fred B. Schneider:“支持高吞吐量和可用性的链复制”,第六届 USENIX 操作系统设计和实现(OSDI) 研讨会,2004 年 12 月。

[8] Robbert van Renesse and Fred B. Schneider: “Chain Replication for Supporting High Throughput and Availability,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[ 9 ] Jeff Terrace 和 Michael J. Freedman:“ CRAQ 上的对象存储:用于读取为主的工作负载的高吞吐量链式复制”,USENIX 年度技术会议(ATC),2009 年 6 月。

[9] Jeff Terrace and Michael J. Freedman: “Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads,” at USENIX Annual Technical Conference (ATC), June 2009.

[ 10 ] Brad Calder、Ju Wang、Aaron Ogus 等人:“ Windows Azure 存储:具有强一致性的高可用云存储服务”,第 23 届 ACM 操作系统原理研讨会(SOSP),2011 年 10 月。

[10] Brad Calder, Ju Wang, Aaron Ogus, et al.: “Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency,” at 23rd ACM Symposium on Operating Systems Principles (SOSP), October 2011.

[ 11 ] Andrew Wang:“ Windows Azure 存储”, umbrant.com,2016 年 2 月 4 日。

[11] Andrew Wang: “Windows Azure Storage,” umbrant.com, February 4, 2016.

[ 12 ]“ Percona Xtrabackup - 文档”,Percona LLC,2014 年。

[12] “Percona Xtrabackup - Documentation,” Percona LLC, 2014.

[ 13 ] Jesse Newland:“本周 GitHub 可用性”,github.com,2012 年 9 月 14 日。

[13] Jesse Newland: “GitHub Availability This Week,” github.com, September 14, 2012.

[ 14 ]Mark Imbriaco:“上周六停机”, github.com,2012 年 12 月 26 日。

[14] Mark Imbriaco: “Downtime Last Saturday,” github.com, December 26, 2012.

[ 15 ] John Hugg:“分布式系统性能和测试的确定性‘全力以赴’ ”,Strange Loop,2015 年 9 月。

[15] John Hugg: “‘All in’ with Determinism for Performance and Testing in Distributed Systems,” at Strange Loop, September 2015.

[ 16 ] Amit Kapila:“ PostgreSQL 的 WAL 内部结构”,PostgreSQL 会议(PGCon),2012 年 5 月。

[16] Amit Kapila: “WAL Internals of PostgreSQL,” at PostgreSQL Conference (PGCon), May 2012.

[ 17 ] MySQL 内部手册。甲骨文,2014 年。

[17] MySQL Internals Manual. Oracle, 2014.

[ 18 ] Yogeshwer Sharma、Philippe Ajoux、Petchean Ang 等人:“ Wormhole:支持地理复制互联网服务的可靠 Pub-Sub ”,第12 届 USENIX 网络系统设计与实现(NSDI) 研讨会,2015 年 5 月。

[18] Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

[ 19 ]“ Oracle GoldenGate 12c:实时访问实时信息”,Oracle 白皮书,2013 年 10 月。

[19] “Oracle GoldenGate 12c: Real-Time Access to Real-Time Information,” Oracle White Paper, October 2013.

[ 20 ] Shirshanka Das、Chavdar Botev、Kapil Surlaker 等人:“全部登上数据总线!”, 2012 年 10 月ACM 云计算研讨会(SoCC)。

[20] Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing (SoCC), October 2012.

[ 21 ] Greg Sabino Mullane:“ Bucardo 数据库复制系统第 5 版”,blog.endpoint.com,2014 年 6 月 23 日。

[21] Greg Sabino Mullane: “Version 5 of Bucardo Database Replication System,” blog.endpoint.com, June 23, 2014.

[ 22 ] Werner Vogels:“最终一致”, ACM Queue,第 6 卷,第 6 号,第 14-19 页,2008 年 10 月 。doi:10.1145/1466443.1466448

[22] Werner Vogels: “Eventually Consistent,” ACM Queue, volume 6, number 6, pages 14–19, October 2008. doi:10.1145/1466443.1466448

[ 23 ] Douglas B. Terry:“通过棒球解释复制数据一致性”,微软研究院,技术报告 MSR-TR-2011-137,2011 年 10 月。

[23] Douglas B. Terry: “Replicated Data Consistency Explained Through Baseball,” Microsoft Research, Technical Report MSR-TR-2011-137, October 2011.

[ 24 ]Douglas B. Terry、Alan J. Demers、Karin Petersen 等人:“弱一致性复制数据的会话保证”,第 3 届并行和分布式信息系统国际会议(PDIS),1994 年 9 月 。doi:10.1109 /PDIS.1994.331722

[24] Douglas B. Terry, Alan J. Demers, Karin Petersen, et al.: “Session Guarantees for Weakly Consistent Replicated Data,” at 3rd International Conference on Parallel and Distributed Information Systems (PDIS), September 1994. doi:10.1109/PDIS.1994.331722

[ 25 ] 特里·普拉切特:收割者:碟形世界小说。维克多·戈兰茨,1991。ISBN:978-0-575-04979-6

[25] Terry Pratchett: Reaper Man: A Discworld Novel. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6

[ 26 ]“钨复制器”,Continentt, Inc.,2014。

[26] “Tungsten Replicator,” Continuent, Inc., 2014.

[ 27 ]“ BDR 0.10.0 文档”,PostgreSQL 全球开发小组,bdr-project.org,2015 年。

[27] “BDR 0.10.0 Documentation,” The PostgreSQL Global Development Group, bdr-project.org, 2015.

[ 28 ] Robert Hodges:“如果您*必须*部署多主复制,请先阅读本文”,scale-out-blog.blogspot.co.uk,2012 年 3 月 30 日。

[28] Robert Hodges: “If You *Must* Deploy Multi-Master Replication, Read This First,” scale-out-blog.blogspot.co.uk, March 30, 2012.

[ 29 ] J. Chris Anderson、Jan Lehnardt 和 Noah Slater:CouchDB:权威指南。奥莱利媒体,2010。ISBN:978-0-596-15589-6

[29] J. Chris Anderson, Jan Lehnardt, and Noah Slater: CouchDB: The Definitive Guide. O’Reilly Media, 2010. ISBN: 978-0-596-15589-6

[ 30 ] AppJet, Inc.:“Etherpad和 EasySync 技术手册”,github.com,2011 年 3 月 26 日。

[30] AppJet, Inc.: “Etherpad and EasySync Technical Manual,” github.com, March 26, 2011.

[ 31 ] John Day-Richter:“新 Google 文档有何不同:加快协作速度”,googledrive.blogspot.com,2010 年 9 月 23 日。

[31] John Day-Richter: “What’s Different About the New Google Docs: Making Collaboration Fast,” googledrive.blogspot.com, 23 September 2010.

[ 32 ] Martin Kleppmann 和 Alastair R. Beresford:“无冲突复制 JSON 数据类型”,arXiv:1608.03960,2016 年 8 月 13 日。

[32] Martin Kleppmann and Alastair R. Beresford: “A Conflict-Free Replicated JSON Datatype,” arXiv:1608.03960, August 13, 2016.

[ 33 ] Frazer Clement:“最终一致性 – 检测冲突”,messagepassing.blogspot.co.uk,2011 年 10 月 20 日。

[33] Frazer Clement: “Eventual Consistency – Detecting Conflicts,” messagepassing.blogspot.co.uk, October 20, 2011.

[ 34 ] Robert Hodges:“ MySQL 多主复制的最新技术”,Percona Live:MySQL 会议暨博览会,2013 年 4 月。

[34] Robert Hodges: “State of the Art for MySQL Multi-Master Replication,” at Percona Live: MySQL Conference & Expo, April 2013.

[ 35 ] John Daily:“时钟很糟糕,或者,欢迎来到分布式系统的奇妙世界”,basho.com,2013 年 11 月 12 日。

[35] John Daily: “Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems,” basho.com, November 12, 2013.

[ 36 ] Riley Berton:“ Postgres 中的双向复制 (BDR) 是事务性的吗?”,sdf.org,2016 年 1 月 4 日。

[36] Riley Berton: “Is Bi-Directional Replication (BDR) in Postgres Transactional?,” sdf.org, January 4, 2016.

[ 37 ] Giuseppe DeCandia、Deniz Hastorun、Madan Jampani 等人:“ Dynamo:Amazon 的高可用性键值存储”,第 21 届 ACM 操作系统原则研讨会(SOSP),2007 年 10 月。

[37] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “Dynamo: Amazon’s Highly Available Key-Value Store,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

[ 38 ] Marc Shapiro、Nuno Preguiça、Carlos Baquero 和 Marek Zawirski:“收敛和交换复制数据类型的综合研究”,INRIA 研究报告第 1 号。7506,2011 年 1 月。

[38] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski: “A Comprehensive Study of Convergent and Commutative Replicated Data Types,” INRIA Research Report no. 7506, January 2011.

[ 39 ] Sam Elliott:“ CRDT:更新(或可能只是 PUT) ”,RICON West,2013 年 10 月。

[39] Sam Elliott: “CRDTs: An UPDATE (or Maybe Just a PUT),” at RICON West, October 2013.

[ 40 ] Russell Brown:“ Riak 中的 CRDT 虚张声势指南”,gist.github.com,2013 年 10 月 28 日。

[40] Russell Brown: “A Bluffers Guide to CRDTs in Riak,” gist.github.com, October 28, 2013.

[ 41 ] Benjamin Farinier、Thomas Gazagnaire 和 Anil Madhavapeddy:“可合并持久数据结构”,第26es Journées Francophones des Langages Applicatifs (JFLA),2015 年 1 月。

[41] Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy: “Mergeable Persistent Data Structures,” at 26es Journées Francophones des Langages Applicatifs (JFLA), January 2015.

[ 42 ] Chengzheng Sun 和 Clarence Ellis:“实时组编辑的操作转型:问题、算法和成就”, ACM 计算机支持协作工作会议(CSCW),1998 年 11 月。

[42] Chengzheng Sun and Clarence Ellis: “Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements,” at ACM Conference on Computer Supported Cooperative Work (CSCW), November 1998.

[ 43 ] Lars Hofhansl:“ HBASE-7709:主/主复制中可能出现无限循环”,issues.apache.org,2013 年 1 月 29 日。

[43] Lars Hofhansl: “HBASE-7709: Infinite Loop Possible in Master/Master Replication,” issues.apache.org, January 29, 2013.

[ 44 ] David K. Gifford:“复制数据的加权投票”,第 7 届 ACM 操作系统原理研讨会(SOSP),1979 年 12 月 。doi:10.1145/800215.806583

[44] David K. Gifford: “Weighted Voting for Replicated Data,” at 7th ACM Symposium on Operating Systems Principles (SOSP), December 1979. doi:10.1145/800215.806583

[ 45 ] Heidi Howard、Dahlia Malkhi 和 Alexander Spiegelman:“灵活的 Paxos:重新审视 Quorum Intersection ”, arXiv:1608.06696,2016年 8 月 24 日。

[45] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “Flexible Paxos: Quorum Intersection Revisited,” arXiv:1608.06696, August 24, 2016.

[ 46 ] Joseph Blomstedt:“回复:绝对一致性”,发送给riak-users邮件列表的电子邮件,lists.basho.com,2012 年 1 月 11 日。

[46] Joseph Blomstedt: “Re: Absolute Consistency,” email to riak-users mailing list, lists.basho.com, January 11, 2012.

[ 47 ] Joseph Blomstedt:“为 Riak 带来一致性”,RICON West,2012 年 10 月。

[47] Joseph Blomstedt: “Bringing Consistency to Riak,” at RICON West, October 2012.

[ 48 ] Peter Bailis、Shivaram Venkataraman、Michael J. Franklin 等人:“用 PBS 量化最终一致性”, Communications of the ACM,第 57 卷,第 8 期,第 93-102 页,2014 年 8 月 。doi:10.1145/2632792

[48] Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, et al.: “Quantifying Eventual Consistency with PBS,” Communications of the ACM, volume 57, number 8, pages 93–102, August 2014. doi:10.1145/2632792

[ 49 ] Jonathan Ellis:“现代提示切换”, datastax.com,2012 年 12 月 11 日。

[49] Jonathan Ellis: “Modern Hinted Handoff,” datastax.com, December 11, 2012.

[ 50 ]“伏地魔维基百科”,github.com,2013。

[50] “Project Voldemort Wiki,” github.com, 2013.

[ 51 ]“ Apache Cassandra 2.0 文档”,DataStax, Inc.,2014 年。

[51] “Apache Cassandra 2.0 Documentation,” DataStax, Inc., 2014.

[ 52 ]“ Riak Enterprise:多数据中心复制。” 技术白皮书,Basho Technologies, Inc.,2014 年 9 月。

[52] “Riak Enterprise: Multi-Datacenter Replication.” Technical whitepaper, Basho Technologies, Inc., September 2014.

[ 53 ] Jonathan Ellis:“为什么 Cassandra 不需要矢量时钟”,datastax.com,2013 年 9 月 2 日。

[53] Jonathan Ellis: “Why Cassandra Doesn’t Need Vector Clocks,” datastax.com, September 2, 2013.

[ 54 ] Leslie Lamport:“分布式系统中的时间、时钟和事件排序”,ACM 通讯,第 21 卷,第 7 期,第 558–565 页,1978 年 7 月 。doi:10.1145/359545.359563

[54] Leslie Lamport: “Time, Clocks, and the Ordering of Events in a Distributed System,” Communications of the ACM, volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

[ 55 ] Joel Jacobson:“ Riak 2.0:数据类型”, blog.joeljacobson.com,2014 年 3 月 23 日。

[55] Joel Jacobson: “Riak 2.0: Data Types,” blog.joeljacobson.com, March 23, 2014.

[ 56 ] D. Stott Parker Jr.、Gerald J. Popek、Gerard Rudisin 等人:“分布式系统中相互不一致的检测”,IEEE 软件工程汇刊,第 9 卷,第 3 期,第 240-247 页,5 月1983.doi :10.1109/TSE.1983.236733

[56] D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, et al.: “Detection of Mutual Inconsistency in Distributed Systems,” IEEE Transactions on Software Engineering, volume 9, number 3, pages 240–247, May 1983. doi:10.1109/TSE.1983.236733

[ 57 ] Nuno Preguiça、Carlos Baquero、Paulo Sérgio Almeida 等人:“点状版本向量:用于乐观复制的逻辑时钟”,arXiv:1011.5808,2010 年 11 月 26 日。

[57] Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, et al.: “Dotted Version Vectors: Logical Clocks for Optimistic Replication,” arXiv:1011.5808, November 26, 2010.

[ 58 ] Sean Cribbs:“ Riak 时间简史”,RICON,2014 年 10 月。

[58] Sean Cribbs: “A Brief History of Time in Riak,” at RICON, October 2014.

[ 59 ] Russell Brown:“矢量时钟重温第 2 部分:点状版本矢量”,basho.com,2015 年 11 月 10 日。

[59] Russell Brown: “Vector Clocks Revisited Part 2: Dotted Version Vectors,” basho.com, November 10, 2015.

[ 60 ] Carlos Baquero:“版本向量不是向量时钟”,haslab.wordpress.com,2011 年 7 月 8 日。

[60] Carlos Baquero: “Version Vectors Are Not Vector Clocks,” haslab.wordpress.com, July 8, 2011.

[ 61 ] Reinhard Schwarz 和 Friedemann Mattern:“检测分布式计算中的因果关系:寻找圣杯”,分布式计算,第 7 卷,第 3 期,第 149–174 页,1994 年 3 月 。doi:10.1007/BF02277859

[61] Reinhard Schwarz and Friedemann Mattern: “Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail,” Distributed Computing, volume 7, number 3, pages 149–174, March 1994. doi:10.1007/BF02277859

第 6 章分区

Chapter 6. Partitioning

显然,我们必须摆脱顺序而不是限制计算机。我们必须陈述定义并提供数据的优先级和描述。我们必须陈述关系,而不是程序。

格蕾丝·默里·霍珀 (Grace Murray Hopper),《管理与未来的计算机》(1962)

Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationships, not procedures.

Grace Murray Hopper, Management and the Computer of the Future (1962)

第 5 章中,我们讨论了复制,即在不同节点上拥有相同数据的多个副本。对于非常大的数据集,或者非常高的查询吞吐量,这是不够的:我们需要将数据分成分区,也称为 分片

In Chapter 5 we discussed replication—that is, having multiple copies of the same data on different nodes. For very large datasets, or very high query throughput, that is not sufficient: we need to break the data up into partitions, also known as sharding.i

术语混乱

Terminological confusion

我们这里所说的分区,在MongoDB、Elasticsearch、SolrCloud中称为分片;它在 HBase 中称为 区域,在 Bigtable 中称为平板电脑,在 Cassandra 和 Riak 中称为vnode,在 Couchbase 中称为vBucket 。然而,分区是最常用的术语,因此我们将坚持使用它。

What we call a partition here is called a shard in MongoDB, Elasticsearch, and SolrCloud; it’s known as a region in HBase, a tablet in Bigtable, a vnode in Cassandra and Riak, and a vBucket in Couchbase. However, partitioning is the most established term, so we’ll stick with that.

通常,分区的定义方式是每条数据(每条记录、行或文档)都属于一个分区。实现这一目标的方法有多种,我们将在本章中深入讨论。实际上,每个分区都是它自己的一个小型数据库,尽管该数据库可能支持同时涉及多个分区的操作。

Normally, partitions are defined in such a way that each piece of data (each record, row, or document) belongs to exactly one partition. There are various ways of achieving this, which we discuss in depth in this chapter. In effect, each partition is a small database of its own, although the database may support operations that touch multiple partitions at the same time.

想要对数据进行分区的主要原因是可伸缩性。不同的分区可以放置在无共享集群中的不同节点上(有关无共享 的定义,请参阅第二部分的介绍)。因此,大型数据集可以分布在许多磁盘上,并且查询负载可以分布在许多处理器上。

The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster (see the introduction to Part II for a definition of shared nothing). Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.

对于在单个分区上操作的查询,每个节点都可以独立执行其自己分区的查询,因此可以通过添加更多节点来扩展查询吞吐量。大型、复杂的查询可能会跨多个节点并行化,尽管这会变得非常困难。

For queries that operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes. Large, complex queries can potentially be parallelized across many nodes, although this gets significantly harder.

分区数据库在 20 世纪 80 年代由 Teradata 和 Tandem NonStop SQL [ 1 ] 等产品首创,最近又被 NoSQL 数据库和基于 Hadoop 的数据仓库重新发现。有些系统是为事务性工作负载而设计的,而另一些系统是为了分析而设计的(请参阅“事务处理还是分析?”):这种差异会影响系统的调整方式,但分区的基本原理适用于这两种工作负载。

Partitioned databases were pioneered in the 1980s by products such as Teradata and Tandem NonStop SQL [1], and more recently rediscovered by NoSQL databases and Hadoop-based data warehouses. Some systems are designed for transactional workloads, and others for analytics (see “Transaction Processing or Analytics?”): this difference affects how the system is tuned, but the fundamentals of partitioning apply to both kinds of workloads.

在本章中,我们将首先了解对大型数据集进行分区的不同方法,并观察数据索引如何与分区交互。然后我们将讨论重新平衡,如果您想在集群中添加或删除节点,这是必要的。最后,我们将概述数据库如何将请求路由到正确的分区并执行查询。

In this chapter we will first look at different approaches for partitioning large datasets and observe how the indexing of data interacts with partitioning. We’ll then talk about rebalancing, which is necessary if you want to add or remove nodes in your cluster. Finally, we’ll get an overview of how databases route requests to the right partitions and execute queries.

分区和复制

Partitioning and Replication

分区通常与复制相结合,以便每个分区的副本存储在多个节点上。这意味着,即使每条记录恰好属于一个分区,它仍然可能存储在多个不同的节点上以实现容错。

Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes. This means that, even though each record belongs to exactly one partition, it may still be stored on several different nodes for fault tolerance.

一个节点可以存储多个分区。如果使用领导者-跟随者复制模型,分区和复制的组合可能如图6-1所示。每个分区的领导者被分配到一个节点,而其追随者被分配到其他节点。每个节点可能是某些分区的领导者和其他分区的跟随者。

A node may store more than one partition. If a leader–follower replication model is used, the combination of partitioning and replication can look like Figure 6-1. Each partition’s leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the leader for some partitions and a follower for other partitions.

我们在第 5 章中讨论的有关数据库复制的所有内容同样适用于分区复制。分区方案的选择大多与复制方案的选择无关,因此本章中我们将保持简单并忽略复制。

Everything we discussed in Chapter 5 about replication of databases applies equally to replication of partitions. The choice of partitioning scheme is mostly independent of the choice of replication scheme, so we will keep things simple and ignore replication in this chapter.

迪迪亚0601
图 6-1。结合复制和分区:每个节点充当某些分区的领导者和其他分区的跟随者。

键值数据的分区

Partitioning of Key-Value Data

假设您有大量数据,并且想要对其进行分区。您如何决定将哪些记录存储在哪些节点上?

Say you have a large amount of data, and you want to partition it. How do you decide which records to store on which nodes?

我们分区的目标是在节点之间均匀分布数据和查询负载。如果每个节点都公平共享,那么理论上,10 个节点应该能够处理单个节点 10 倍的数据量和 10 倍的读写吞吐量(暂时忽略复制)。

Our goal with partitioning is to spread the data and the query load evenly across nodes. If every node takes a fair share, then—in theory—10 nodes should be able to handle 10 times as much data and 10 times the read and write throughput of a single node (ignoring replication for now).

如果分区不公平,导致某些分区比其他分区拥有更多的数据或查询,我们称之为倾斜。倾斜的存在使得分区的效率大大降低。在一种极端情况下,所有负载可能最终都集中在一个分区上,因此 10 个节点中有 9 个处于空闲状态,而瓶颈就是单个繁忙节点。负载过高的分区称为热点

If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed. The presence of skew makes partitioning much less effective. In an extreme case, all the load could end up on one partition, so 9 out of 10 nodes are idle and your bottleneck is the single busy node. A partition with disproportionately high load is called a hot spot.

避免热点的最简单方法是将记录随机分配给节点。这会将数据相当均匀地分布在节点上,但它有一个很大的缺点:当您尝试读取特定项目时,您无法知道它位于哪个节点上,因此您必须并行查询所有节点。

The simplest approach for avoiding hot spots would be to assign records to nodes randomly. That would distribute the data quite evenly across the nodes, but it has a big disadvantage: when you’re trying to read a particular item, you have no way of knowing which node it is on, so you have to query all nodes in parallel.

我们可以做得更好。现在假设您有一个简单的键值数据模型,在该模型中您始终通过主键访问记录。例如,在老式的纸质百科全书中,您可以通过标题查找条目;由于所有条目均按标题字母顺序排序,因此您可以快速找到所需的条目。

We can do better. Let’s assume for now that you have a simple key-value data model, in which you always access a record by its primary key. For example, in an old-fashioned paper encyclopedia, you look up an entry by its title; since all the entries are alphabetically sorted by title, you can quickly find the one you’re looking for.

按键范围分区

Partitioning by Key Range

分区的一种方法是为每个分区分配连续范围的键(从最小到最大),就像纸质百科全书的卷一样(图6-2)。如果您知道范围之间的边界,则可以轻松确定哪个分区包含给定键。如果您还知道哪个分区分配给哪个节点,那么您可以直接向适当的节点发出请求(或者,对于百科全书,从书架上挑选正确的书)。

One way of partitioning is to assign a continuous range of keys (from some minimum to some maximum) to each partition, like the volumes of a paper encyclopedia (Figure 6-2). If you know the boundaries between the ranges, you can easily determine which partition contains a given key. If you also know which partition is assigned to which node, then you can make your request directly to the appropriate node (or, in the case of the encyclopedia, pick the correct book off the shelf).

迪迪亚0602
图 6-2。印刷版百科全书按键范围进行分区。

键的范围不一定是均匀分布的,因为您的数据可能不是均匀分布的。例如,在图 6-2中,第 1 卷包含以 A 和 B 开头的单词,但第 12 卷包含以 T、U、V、X、Y 和 Z 开头的单词。只需将字母表中的每两个字母分成一卷即可导致某些体积比其他体积大得多。为了均匀分布数据,分区边界需要适应数据。

The ranges of keys are not necessarily evenly spaced, because your data may not be evenly distributed. For example, in Figure 6-2, volume 1 contains words starting with A and B, but volume 12 contains words starting with T, U, V, X, Y, and Z. Simply having one volume per two letters of the alphabet would lead to some volumes being much bigger than others. In order to distribute the data evenly, the partition boundaries need to adapt to the data.

分区边界可以由管理员手动选择,或者数据库可以自动选择(我们将在“重新平衡分区”中更详细地讨论分区边界的选择)。Bigtable、其开源等效 HBase [ 2 , 3 ]、RethinkDB 和 2.4 版本之前的 MongoDB [ 4 ]使用这种分区策略。

The partition boundaries might be chosen manually by an administrator, or the database can choose them automatically (we will discuss choices of partition boundaries in more detail in “Rebalancing Partitions”). This partitioning strategy is used by Bigtable, its open source equivalent HBase [2, 3], RethinkDB, and MongoDB before version 2.4 [4].

在每个分区中,我们可以按排序顺序保存键(请参阅“SSTables 和 LSM-Trees”)。这样做的优点是范围扫描很容易,并且您可以将键视为串联索引,以便在一个查询中获取多个相关记录(请参阅“多列索引”)。例如,考虑一个存储来自传感器网络的数据的应用程序,其中关键是测量的时间戳(年-月-日-时-分-秒)。在这种情况下,范围扫描非常有用,因为它们可以让您轻松获取特定月份的所有读数。

Within each partition, we can keep keys in sorted order (see “SSTables and LSM-Trees”). This has the advantage that range scans are easy, and you can treat the key as a concatenated index in order to fetch several related records in one query (see “Multi-column indexes”). For example, consider an application that stores data from a network of sensors, where the key is the timestamp of the measurement (year-month-day-hour-minute-second). Range scans are very useful in this case, because they let you easily fetch, say, all the readings from a particular month.

然而,键范围分区的缺点是某些访问模式可能会导致热点。如果键是时间戳,则分区对应于时间范围,例如每天一个分区。不幸的是,因为我们在测量时将数据从传感器写入数据库,所以所有写入最终都会进入同一个分区(今天的分区),因此该分区可能因写入而过载,而其他分区则闲置 [5 ]

However, the downside of key range partitioning is that certain access patterns can lead to hot spots. If the key is a timestamp, then the partitions correspond to ranges of time—e.g., one partition per day. Unfortunately, because we write data from the sensors to the database as the measurements happen, all the writes end up going to the same partition (the one for today), so that partition can be overloaded with writes while others sit idle [5].

为了避免传感器数据库中出现此问题,您需要使用时间戳以外的其他内容作为键的第一个元素。例如,您可以为每个时间戳添加传感器名称前缀,以便首先按传感器名称分区,然后按时间分区。假设您有许多传感器同时处于活动状态,则写入负载最终将更均匀地分布在各个分区上。现在,当您想要获取某个时间范围内多个传感器的值时,您需要对每个传感器名称执行单独的范围查询。

To avoid this problem in the sensor database, you need to use something other than the timestamp as the first element of the key. For example, you could prefix each timestamp with the sensor name so that the partitioning is first by sensor name and then by time. Assuming you have many sensors active at the same time, the write load will end up more evenly spread across the partitions. Now, when you want to fetch the values of multiple sensors within a time range, you need to perform a separate range query for each sensor name.

按密钥哈希分区

Partitioning by Hash of Key

由于存在倾斜和热点的风险,许多分布式数据存储使用哈希函数来确定给定键的分区。

Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key.

一个好的哈希函数可以处理倾斜的数据并使其均匀分布。假设您有一个接受字符串的 32 位哈希函数。每当你给它一个新字符串时,它都会返回一个看似随机的 0 到 2 32  − 1 之间的数字。即使输入字符串非常相似,它们的哈希值也会均匀分布在该数字范围内。

A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit hash function that takes a string. Whenever you give it a new string, it returns a seemingly random number between 0 and 232 − 1. Even if the input strings are very similar, their hashes are evenly distributed across that range of numbers.

出于分区目的,哈希函数不需要具有很强的加密强度:例如,Cassandra 和 MongoDB 使用 MD5,Voldemort 使用 Fowler–Noll–Vo 函数。许多编程语言都内置了简单的哈希函数(因为它们用于哈希表),但它们可能不适合分区:例如,在 Java 和 Ruby 中,相同Object.hashCode()Object#hash键在不同的进程中可能具有不同的哈希值 [ 6 ]。

For partitioning purposes, the hash function need not be cryptographically strong: for example, Cassandra and MongoDB use MD5, and Voldemort uses the Fowler–Noll–Vo function. Many programming languages have simple hash functions built in (as they are used for hash tables), but they may not be suitable for partitioning: for example, in Java’s Object.hashCode() and Ruby’s Object#hash, the same key may have a different hash value in different processes [6].

一旦你有了合适的键散列函数,你就可以为每个分区分配一个散列范围(而不是一个键范围),并且散列落在分区范围内的每个键都将存储在该分区中。如图 6-3所示。

Once you have a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of keys), and every key whose hash falls within a partition’s range will be stored in that partition. This is illustrated in Figure 6-3.

迪迪亚0603
图 6-3。按密钥哈希分区。

该技术擅长在分区之间公平分配密钥。分区边界可以均匀分布,也可以伪随机选择(在这种情况下,该技术有时称为一致性哈希)。

This technique is good at distributing keys fairly among the partitions. The partition boundaries can be evenly spaced, or they can be chosen pseudorandomly (in which case the technique is sometimes known as consistent hashing).

然而不幸的是,通过使用键的哈希进行分区,我们失去了键范围分区的一个很好的属性:进行有效范围查询的能力。曾经相邻的键现在分散在所有分区中,因此它们的排序顺序丢失了。在 MongoDB 中,如果启用了基于哈希的分片模式,则任何范围查询都必须发送到所有分区 [ 4 ]。Riak [ 9 ]、Couchbase [ 10 ] 或 Voldemort 不支持主键上的范围查询。

Unfortunately however, by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries. Keys that were once adjacent are now scattered across all the partitions, so their sort order is lost. In MongoDB, if you have enabled hash-based sharding mode, any range query has to be sent to all partitions [4]. Range queries on the primary key are not supported by Riak [9], Couchbase [10], or Voldemort.

Cassandra在 两种分区策略之间实现了 折衷 [ 11,12,13 ]。Cassandra 中的表可以使用由多个列组成的复合主键来声明。仅对该键的第一部分进行散列以确定分区,但其他列用作串联索引,用于对 Cassandra 的 SSTable 中的数据进行排序。因此,查询无法搜索复合键第一列内的值范围,但如果它为第一列指定固定值,则它可以对该键的其他列执行有效的范围扫描。

Cassandra achieves a compromise between the two partitioning strategies [11, 12, 13]. A table in Cassandra can be declared with a compound primary key consisting of several columns. Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables. A query therefore cannot search for a range of values within the first column of a compound key, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns of the key.

串联索引方法为一对多关系提供了优雅的数据模型。例如,在社交媒体网站上,一个用户可能会发布许多更新。如果更新的主键选择为(user_id, update_timestamp),那么您可以有效地检索特定用户在某个时间间隔内所做的所有更新,并按时间戳排序。不同的用户可能存储在不同的分区上,但在每个用户内,更新按时间戳顺序存储在单个分区上。

The concatenated index approach enables an elegant data model for one-to-many relationships. For example, on a social media site, one user may post many updates. If the primary key for updates is chosen to be (user_id, update_timestamp), then you can efficiently retrieve all updates made by a particular user within some time interval, sorted by timestamp. Different users may be stored on different partitions, but within each user, the updates are stored ordered by timestamp on a single partition.

倾斜的工作负载和缓解热点

Skewed Workloads and Relieving Hot Spots

正如所讨论的,对密钥进行散列以确定其分区可以帮助减少热点。但是,它无法完全避免它们:在所有读取和写入都针对同一键的极端情况下,您仍然最终将所有请求路由到同一分区。

As discussed, hashing a key to determine its partition can help reduce hot spots. However, it can’t avoid them entirely: in the extreme case where all reads and writes are for the same key, you still end up with all requests being routed to the same partition.

这种工作量也许不寻常,但并非闻所未闻:例如,在社交媒体网站上,拥有数百万粉丝的名人用户在做某事时可能会引起一场活动风暴[14 ]。此事件可能会导致对同一键进行大量写入(其中键可能是名人的用户 ID,或者人们正在评论的操作的 ID)。对密钥进行哈希处理并没有帮助,因为两个相同 ID 的哈希值仍然相同。

This kind of workload is perhaps unusual, but not unheard of: for example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when they do something [14]. This event can result in a large volume of writes to the same key (where the key is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on). Hashing the key doesn’t help, as the hash of two identical IDs is still the same.

如今,大多数数据系统无法自动补偿如此高度倾斜的工作负载,因此应用程序有责任减少倾斜。例如,如果已知一个密钥非常热,一种简单的技术就是在该密钥的开头或结尾添加一个随机数。只需一个两位数的十进制随机数即可将密钥写入均匀地分布在 100 个不同的密钥上,从而允许将这些密钥分发到不同的分区。

Today, most data systems are not able to automatically compensate for such a highly skewed workload, so it’s the responsibility of the application to reduce the skew. For example, if one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key. Just a two-digit decimal random number would split the writes to the key evenly across 100 different keys, allowing those keys to be distributed to different partitions.

然而,在将写入拆分到不同的键后,任何读取现在都必须执行额外的工作,因为它们必须从所有 100 个键读取数据并将其组合起来。该技术还需要额外的簿记:仅对少量热键附加随机数才有意义;对于绝大多数写入吞吐量较低的键来说,这将是不必要的开销。因此,您还需要某种方法来跟踪哪些键被拆分。

However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all 100 keys and combine it. This technique also requires additional bookkeeping: it only makes sense to append the random number for the small number of hot keys; for the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you also need some way of keeping track of which keys are being split.

也许在未来,数据系统将能够自动检测和补偿倾斜的工作负载;但现在,您需要考虑自己的应用程序的权衡。

Perhaps in the future, data systems will be able to automatically detect and compensate for skewed workloads; but for now, you need to think through the trade-offs for your own application.

分区和二级索引

Partitioning and Secondary Indexes

到目前为止我们讨论的分区方案依赖于键值数据模型。如果记录仅通过主键访问,我们可以根据该键确定分区,并使用它将读写请求路由到负责该键的分区。

The partitioning schemes we have discussed so far rely on a key-value data model. If records are only ever accessed via their primary key, we can determine the partition from that key and use it to route read and write requests to the partition responsible for that key.

如果涉及二级索引,情况会变得更加复杂(另见 “其他索引结构”)。二级索引通常不会唯一地标识一条记录,而是一种搜索特定值出现的方式:查找用户 的所有操作, 123查找包含单词 的所有文章hogwash,查找颜色为 的所有汽车red,等等。

The situation becomes more complicated if secondary indexes are involved (see also “Other Indexing Structures”). A secondary index usually doesn’t identify a record uniquely but rather is a way of searching for occurrences of a particular value: find all actions by user 123, find all articles containing the word hogwash, find all cars whose color is red, and so on.

二级索引是关系数据库的基础,在文档数据库中也很常见。许多键值存储(例如 HBase 和 Voldemort)都避免使用二级索引,因为它们增加了实现复杂性,但有些(例如 Riak)已经开始添加它们,因为它们对于数据建模非常有用。最后,二级索引是 Solr 和 Elasticsearch 等搜索服务器存在的理由。

Secondary indexes are the bread and butter of relational databases, and they are common in document databases too. Many key-value stores (such as HBase and Voldemort) have avoided secondary indexes because of their added implementation complexity, but some (such as Riak) have started adding them because they are so useful for data modeling. And finally, secondary indexes are the raison d’être of search servers such as Solr and Elasticsearch.

二级索引的问题在于它们不能整齐地映射到分区。使用二级索引对数据库进行分区的主要方法有两种:基于文档的分区和基于术语的分区。

The problem with secondary indexes is that they don’t map neatly to partitions. There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning.

按文档分区二级索引

Partitioning Secondary Indexes by Document

例如,假设您正在运营一个销售二手车的网站( 如图 6-4所示)。每个列表都有一个唯一的 ID(称为文档 ID),并且您可以按文档 ID 对数据库进行分区(例如,分区 0 中的 ID 0 到 499,分区 1 中的 ID 500 到 999 等)。

For example, imagine you are operating a website for selling used cars (illustrated in Figure 6-4). Each listing has a unique ID—call it the document ID—and you partition the database by the document ID (for example, IDs 0 to 499 in partition 0, IDs 500 to 999 in partition 1, etc.).

您想让用户搜索汽车,允许他们按颜色和品牌进行过滤,因此您需要 和 的二级索引colormake在文档数据库中这些将是字段;在关系数据库中它们将是列)。如果你声明了索引,数据库会自动执行索引。ii 例如,每当将一辆红色汽车添加到数据库中时,数据库分区会自动将其添加到索引条目的文档 ID 列表中color:red

You want to let users search for cars, allowing them to filter by color and by make, so you need a secondary index on color and make (in a document database these would be fields; in a relational database they would be columns). If you have declared the index, the database can perform the indexing automatically.ii For example, whenever a red car is added to the database, the database partition automatically adds it to the list of document IDs for the index entry color:red.

迪迪亚0604
图 6-4。按文档对二级索引进行分区。

在这种索引方法中,每个分区是完全独立的:每个分区维护自己的二级索引,仅覆盖该分区中的文档。它不关心其他分区中存储了什么数据。每当您需要写入数据库(添加、删除或更新文档)时,您只需处理包含您正在写入的文档 ID 的分区。因此,文档分区索引也称为本地索引(与下一节中描述的全局索引相对)。

In this indexing approach, each partition is completely separate: each partition maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions. Whenever you need to write to the database—to add, remove, or update a document—you only need to deal with the partition that contains the document ID that you are writing. For that reason, a document-partitioned index is also known as a local index (as opposed to a global index, described in the next section).

然而,读取文档分区索引时需要小心:除非您对文档 ID 做了一些特殊的操作,否则没有理由所有具有特定颜色或特定品牌的汽车都位于同一分区中。在图 6-4中,红色汽车同时出现在分区 0 和分区 1 中。因此,如果要搜索红色汽车,则需要将查询发送到所有分区,并将返回的所有结果合并起来。

However, reading from a document-partitioned index requires care: unless you have done something special with the document IDs, there is no reason why all the cars with a particular color or a particular make would be in the same partition. In Figure 6-4, red cars appear in both partition 0 and partition 1. Thus, if you want to search for red cars, you need to send the query to all partitions, and combine all the results you get back.

这种查询分区数据库的方法有时称为分散/聚集,它会使二级索引上的读取查询变得非常昂贵。即使您并行查询分区,分散/聚集也容易出现尾部延迟放大(请参阅“百分位数实践”)。尽管如此,它还是被广泛使用:MongoDB、Riak [ 15 ]、Cassandra [ 16 ]、Elasticsearch [ 17 ]、SolrCloud [ 18 ] 和 VoltDB [ 19]] 全部使用文档分区二级索引。大多数数据库供应商建议您构建分区方案,以便可以从单个分区提供二级索引查询,但这并不总是可行,特别是当您在单个查询中使用多个二级索引时(例如按颜色过滤汽车)并同时通过 make)。

This approach to querying a partitioned database is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive. Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification (see “Percentiles in Practice”). Nevertheless, it is widely used: MongoDB, Riak [15], Cassandra [16], Elasticsearch [17], SolrCloud [18], and VoltDB [19] all use document-partitioned secondary indexes. Most database vendors recommend that you structure your partitioning scheme so that secondary index queries can be served from a single partition, but that is not always possible, especially when you’re using multiple secondary indexes in a single query (such as filtering cars by color and by make at the same time).

迪迪亚0605
图 6-5。按术语分区二级索引。

按术语分区二级索引

Partitioning Secondary Indexes by Term

我们可以构建一个 覆盖所有分区中数据的全局索引, 而不是每个分区都有自己的二级索引(本地索引)。但是,我们不能只将该索引存储在一个节点上,因为它可能会成为瓶颈并破坏分区的目的。全局索引也必须进行分区,但它的分区方式可以与主键索引不同。

Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data in all partitions. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of partitioning. A global index must also be partitioned, but it can be partitioned differently from the primary key index.

图 6-5说明了这种情况:所有分区中的红色汽车都出现color:red在索引下方,但索引已分区,因此以字母ar开头的颜色出现在分区 0 中,以sz开头的颜色出现在分区 0 中。分区 1. 汽车品牌的索引也进行类似的分区(分区边界在fh之间)。

Figure 6-5 illustrates what this could look like: red cars from all partitions appear under color:red in the index, but the index is partitioned so that colors starting with the letters a to r appear in partition 0 and colors starting with s to z appear in partition 1. The index on the make of car is partitioned similarly (with the partition boundary being between f and h).

我们将这种索引称为term-partitioned,因为我们要查找的 term 决定了索引的分区。color:red例如,这里的术语是。术语术语来自全文索引(一种特殊类型的二级索引),其中术语是文档中出现的所有单词。

We call this kind of index term-partitioned, because the term we’re looking for determines the partition of the index. Here, a term would be color:red, for example. The name term comes from full-text indexes (a particular kind of secondary index), where the terms are all the words that occur in a document.

和以前一样,我们可以按术语本身或使用术语的哈希对索引进行分区。按术语本身进行分区对于范围扫描很有用(例如,在数字属性上,例如汽车的要价),而按术语的散列进行分区则可以提供更均匀的负载分布。

As before, we can partition the index by the term itself, or using a hash of the term. Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load.

全局(术语分区)索引相对于文档分区索引的优势在于,它可以使读取更加高效:客户端只需向包含术语的分区发出请求,而不是在所有分区上进行分散/聚集它想要的。然而,全局索引的缺点是写入速度更慢且更复杂,因为对单个文档的写入现在可能会影响索引的多个分区(文档中的每个术语可能位于不同分区、不同节点上) 。

The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants. However, the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).

在理想的情况下,索引将始终是最新的,写入数据库的每个文档都将立即反映在索引中。但是,在术语分区索引中,这将需要跨受写入影响的所有分区进行分布式事务,但并非所有数据库都支持这一点(请参阅第7章和第 9 章)。

In an ideal world, the index would always be up to date, and every document written to the database would immediately be reflected in the index. However, in a term-partitioned index, that would require a distributed transaction across all partitions affected by a write, which is not supported in all databases (see Chapter 7 and Chapter 9).

实际上,对全局二级索引的更新通常是异步的(也就是说,如果您在写入后不久读取索引,则刚刚所做的更改可能尚未反映在索引中)。例如,Amazon DynamoDB 声明其全局二级索引在正常情况下会在不到一秒的时间内更新,但在基础设施出现故障的情况下可能会经历更长的传播延迟 [20 ]

In practice, updates to global secondary indexes are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index). For example, Amazon DynamoDB states that its global secondary indexes are updated within a fraction of a second in normal circumstances, but may experience longer propagation delays in cases of faults in the infrastructure [20].

全局术语分区索引的其他用途包括 Riak 的搜索功能 [ 21 ] 和 Oracle 数据仓库,它允许您在本地索引和全局索引之间进行选择 [ 22 ]。我们将在第 12 章回到实现术语分区二级索引的主题。

Other uses of global term-partitioned indexes include Riak’s search feature [21] and the Oracle data warehouse, which lets you choose between local and global indexing [22]. We will return to the topic of implementing term-partitioned secondary indexes in Chapter 12.

重新平衡分区

Rebalancing Partitions

随着时间的推移,数据库中的情况会发生变化:

Over time, things change in a database:

  • 查询吞吐量增加,因此您需要添加更多 CPU 来处理负载。

  • The query throughput increases, so you want to add more CPUs to handle the load.

  • 数据集大小增加,因此您需要添加更多磁盘和 RAM 来存储它。

  • The dataset size increases, so you want to add more disks and RAM to store it.

  • 一台机器出现故障,其他机器需要接管故障机器的职责。

  • A machine fails, and other machines need to take over the failed machine’s responsibilities.

所有这些变化都要求将数据和请求从一个节点移动到另一个节点。将负载从集群中的一个节点移动到另一个节点的过程称为重新平衡

All of these changes call for data and requests to be moved from one node to another. The process of moving load from one node in the cluster to another is called rebalancing.

无论使用哪种分区方案,重新平衡通常都需要满足一些最低要求:

No matter which partitioning scheme is used, rebalancing is usually expected to meet some minimum requirements:

  • 重新平衡后,负载(数据存储、读写请求)应在集群中的节点之间公平共享。

  • After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster.

  • 在进行重新平衡时,数据库应继续接受读取和写入。

  • While rebalancing is happening, the database should continue accepting reads and writes.

  • 节点之间不应移动过多的数据,以实现快速重新平衡并最大限度地减少网络和磁盘 I/O 负载。

  • No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.

再平衡策略

Strategies for Rebalancing

有几种不同的方法可以将分区分配给节点[ 23 ]。让我们依次简要讨论每一个。

There are a few different ways of assigning partitions to nodes [23]. Let’s briefly discuss each in turn.

如何不这样做:hash mod N

How not to do it: hash mod N

当按键的哈希值进行分区时,我们之前说过(图 6-3),最好将可能的哈希值划分为多个范围,并将每个范围分配给一个分区(例如,如果 0 ≤  hash ( key ),则将分配给分区 0 ) <  b 0 ,如果b 0  ≤  hash ( key ) <  b 1 ,则分区 1 ,等等)。

When partitioning by the hash of a key, we said earlier (Figure 6-3) that it’s best to divide the possible hashes into ranges and assign each range to a partition (e.g., assign key to partition 0 if 0 ≤ hash(key) < b0, to partition 1 if b0 ≤ hash(key) < b1, etc.).

也许您想知道为什么我们不只使用mod%许多编程语言中的运算符)。例如,hash ( key ) mod 10 将返回 0 到 9 之间的数字(如果我们将哈希写为十进制数,则哈希mod 10 将是最后一位数字)。如果我们有 10 个节点,编号为 0 到 9,这似乎是将每个键分配给节点的简单方法。

Perhaps you wondered why we don’t just use mod (the % operator in many programming languages). For example, hash(key) mod 10 would return a number between 0 and 9 (if we write the hash as a decimal number, the hash mod 10 would be the last digit). If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node.

mod N方法的问题在于,如果节点数量N发生变化,则大多数密钥将需要从一个节点移动到另一个节点。例如,假设hash ( key ) = 123456。如果您最初有 10 个节点,则该密钥从节点 6 开始(因为 123456  mod  10 = 6)。当增长到11个节点时,密钥需要移动到节点3(123456  mod  11 = 3),当增长到12个节点时,它需要移动到节点0(123456  mod  12 = 0)。如此频繁的举措使得重新平衡成本过高。

The problem with the mod N approach is that if the number of nodes N changes, most of the keys will need to be moved from one node to another. For example, say hash(key) = 123456. If you initially have 10 nodes, that key starts out on node 6 (because 123456 mod 10 = 6). When you grow to 11 nodes, the key needs to move to node 3 (123456 mod 11 = 3), and when you grow to 12 nodes, it needs to move to node 0 (123456 mod 12 = 0). Such frequent moves make rebalancing excessively expensive.

我们需要一种不会过度移动数据的方法。

We need an approach that doesn’t move data around more than necessary.

固定数量的分区

Fixed number of partitions

幸运的是,有一个相当简单的解决方案:创建比节点数量多得多的分区,并为每个节点分配多个分区。例如,在 10 个节点的集群上运行的数据库可能从一开始就被分为 1,000 个分区,以便为每个节点分配大约 100 个分区。

Fortunately, there is a fairly simple solution: create many more partitions than there are nodes, and assign several partitions to each node. For example, a database running on a cluster of 10 nodes may be split into 1,000 partitions from the outset so that approximately 100 partitions are assigned to each node.

现在,如果将节点添加到集群中,新节点可以从每个现有节点窃取一些分区,直到分区再次公平分配。该过程如图6-6所示 。如果从集群中删除一个节点,则相反的情况也会发生。

Now, if a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. This process is illustrated in Figure 6-6. If a node is removed from the cluster, the same happens in reverse.

仅整个分区在节点之间移动。分区的数量不会改变,分区的键分配也不会改变。唯一改变的是分区到节点的分配。这种分配的更改不是立即发生的 - 通过网络传输大量数据需要一些时间 - 因此旧的分区分配用于传输过程中发生的任何读取和写入。

Only entire partitions are moved between nodes. The number of partitions does not change, nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes. This change of assignment is not immediate—it takes some time to transfer a large amount of data over the network—so the old assignment of partitions is used for any reads and writes that happen while the transfer is in progress.

迪迪亚0606
图 6-6。将新节点添加到每个节点有多个分区的数据库集群。

原则上,您甚至可以考虑集群中不匹配的硬件:通过将更多分区分配给功能更强大的节点,您可以强制这些节点承担更大的负载份额。

In principle, you can even account for mismatched hardware in your cluster: by assigning more partitions to nodes that are more powerful, you can force those nodes to take a greater share of the load.

这种重新平衡方法用于 Riak [ 15 ]、Elasticsearch [ 24 ]、Couchbase [ 10 ] 和 Voldemort [ 25 ]。

This approach to rebalancing is used in Riak [15], Elasticsearch [24], Couchbase [10], and Voldemort [25].

在此配置中,分区的数量通常在数据库首次设置时是固定的,之后不会更改。虽然原则上可以拆分和合并分区(参见下一节),但固定数量的分区操作起来更简单,因此很多固定分区数据库选择不实现分区拆分。因此,一开始配置的分区数量就是您可以拥有的最大节点数量,因此您需要选择足够高的分区数量以适应未来的增长。然而,每个分区也有管理开销,因此选择过高的数字会适得其反。

In this configuration, the number of partitions is usually fixed when the database is first set up and not changed afterward. Although in principle it’s possible to split and merge partitions (see the next section), a fixed number of partitions is operationally simpler, and so many fixed-partition databases choose not to implement partition splitting. Thus, the number of partitions configured at the outset is the maximum number of nodes you can have, so you need to choose it high enough to accommodate future growth. However, each partition also has management overhead, so it’s counterproductive to choose too high a number.

如果数据集的总大小变化很大(例如,如果数据集一开始很小,但随着时间的推移可能会变得更大),那么选择正确的分区数量就很困难。由于每个分区包含总数据的固定比例,因此每个分区的大小与集群中的数据总量成比例增长。如果分区非常大,重新平衡和从节点故障中恢复会变得昂贵。但如果分区太小,则会产生太大的开销。当分区的大小“恰到好处”,既不太大也不太小时,就能实现最佳性能,如果分区数量固定但数据集大小不同,则很难实现这一点。

Choosing the right number of partitions is difficult if the total size of the dataset is highly variable (for example, if it starts small but may grow much larger over time). Since each partition contains a fixed fraction of the total data, the size of each partition grows proportionally to the total amount of data in the cluster. If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead. The best performance is achieved when the size of partitions is “just right,” neither too big nor too small, which can be hard to achieve if the number of partitions is fixed but the dataset size varies.

动态分区

Dynamic partitioning

对于使用键范围分区的数据库(请参阅“按键范围分区”),固定数量且具有固定边界的分区会非常不方便:如果边界错误,您可能最终会得到一个分区中的所有数据,并且所有其他分区都是空的。手动重新配置分区边界将非常繁琐。

For databases that use key range partitioning (see “Partitioning by Key Range”), a fixed number of partitions with fixed boundaries would be very inconvenient: if you got the boundaries wrong, you could end up with all of the data in one partition and all of the other partitions empty. Reconfiguring the partition boundaries manually would be very tedious.

因此,HBase 和 RethinkDB 等键范围分区数据库会动态创建分区。当分区增长到超过配置的大小(在 HBase 上,默认为 10 GB)时,它会被拆分为两个分区,以便大约一半的数据最终位于拆分的每一侧 [26 ]。相反,如果删除了大量数据并且分区缩小到某个阈值以下,则可以将其与相邻分区合并。 此过程类似于 B 树顶层发生的情况(请参阅“B 树”)。

For that reason, key range–partitioned databases such as HBase and RethinkDB create partitions dynamically. When a partition grows to exceed a configured size (on HBase, the default is 10 GB), it is split into two partitions so that approximately half of the data ends up on each side of the split [26]. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This process is similar to what happens at the top level of a B-tree (see “B-Trees”).

每个分区分配给一个节点,每个节点可以处理多个分区,就像固定数量分区的情况一样。当一个大分区被分割后,它的两半之一可以转移到另一个节点以平衡负载。对于 HBase,分区文件的传输通过底层分布式文件系统 HDFS 进行[ 3 ]。

Each partition is assigned to one node, and each node can handle multiple partitions, like in the case of a fixed number of partitions. After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load. In the case of HBase, the transfer of partition files happens through HDFS, the underlying distributed filesystem [3].

动态分区的优点是分区的数量适应总数据量。如果数据量很小,那么少量的分区就足够了,所以开销很小;如果数据量很大,每个分区的大小都会限制在可配置的最大值[ 23 ]。

An advantage of dynamic partitioning is that the number of partitions adapts to the total data volume. If there is only a small amount of data, a small number of partitions is sufficient, so overheads are small; if there is a huge amount of data, the size of each individual partition is limited to a configurable maximum [23].

然而,需要注意的是,空数据库从单个分区开始,因为没有关于在哪里绘制分区边界的先验信息。虽然数据集很小(直到达到第一个分区的分割点),但所有写入都必须由单个节点处理,而其他节点则处于空闲状态。为了缓解这个问题,HBase 和 MongoDB 允许在空数据库上配置一组初始分区(这称为预分割)。在键范围分区的情况下,预分割要求您已经知道键分布将是什么样子 [ 4 , 26 ]。

However, a caveat is that an empty database starts off with a single partition, since there is no a priori information about where to draw the partition boundaries. While the dataset is small—until it hits the point at which the first partition is split—all writes have to be processed by a single node while the other nodes sit idle. To mitigate this issue, HBase and MongoDB allow an initial set of partitions to be configured on an empty database (this is called pre-splitting). In the case of key-range partitioning, pre-splitting requires that you already know what the key distribution is going to look like [4, 26].

动态分区不仅适用于键范围分区数据,而且同样适用于散列分区数据。MongoDB 从 2.4 版开始支持键范围分区和哈希分区,并且在这两种情况下都会动态分割分区。

Dynamic partitioning is not only suitable for key range–partitioned data, but can equally well be used with hash-partitioned data. MongoDB since version 2.4 supports both key-range and hash partitioning, and it splits partitions dynamically in either case.

按节点比例分区

Partitioning proportionally to nodes

通过动态分区,分区的数量与数据集的大小成正比,因为拆分和合并过程将每个分区的大小保持在某个固定的最小值和最大值之间。另一方面,对于固定数量的分区,每个分区的大小与数据集的大小成正比。在这两种情况下,分区的数量与节点的数量无关。

With dynamic partitioning, the number of partitions is proportional to the size of the dataset, since the splitting and merging processes keep the size of each partition between some fixed minimum and maximum. On the other hand, with a fixed number of partitions, the size of each partition is proportional to the size of the dataset. In both of these cases, the number of partitions is independent of the number of nodes.

Cassandra 和 Ketama 使用的第三个选项是使分区数量 节点数量成正比,换句话说,每个节点具有固定数量的 分区[ 23、27、28 ]。在这种情况下,每个分区的大小与数据集大小成比例增长,而节点数保持不变,但是当增加节点数时,分区会再次变小。由于较大的数据量通常需要更多的节点来存储,因此这种方法也使每个分区的大小保持相当稳定。

A third option, used by Cassandra and Ketama, is to make the number of partitions proportional to the number of nodes—in other words, to have a fixed number of partitions per node [23, 27, 28]. In this case, the size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again. Since a larger data volume generally requires a larger number of nodes to store, this approach also keeps the size of each partition fairly stable.

当新节点加入集群时,它会随机选择固定数量的现有分区进行分割,然后获得每个分割分区的一半的所有权,同时保留每个分区的另一半​​。随机化可能会产生不公平的分割,但是当对大量分区进行平均时(在 Cassandra 中,默认情况下每个节点 256 个分区),新节点最终会从现有节点中分担公平的负载。Cassandra 3.0 引入了另一种重新平衡算法,可以避免不公平的分割 [ 29 ]。

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place. The randomization can produce unfair splits, but when averaged over a larger number of partitions (in Cassandra, 256 partitions per node by default), the new node ends up taking a fair share of the load from the existing nodes. Cassandra 3.0 introduced an alternative rebalancing algorithm that avoids unfair splits [29].

随机选取分区边界需要使用基于哈希的分区(因此可以从哈希函数生成的数字范围中选取边界)。事实上,这种方法最接近一致哈希的原始定义[ 7 ](参见“一致性哈希”)。较新的哈希函数可以以较低的元数据开销实现类似的效果[ 8 ]。

Picking partition boundaries randomly requires that hash-based partitioning is used (so the boundaries can be picked from the range of numbers produced by the hash function). Indeed, this approach corresponds most closely to the original definition of consistent hashing [7] (see “Consistent Hashing”). Newer hash functions can achieve a similar effect with lower metadata overhead [8].

操作:自动或手动重新平衡

Operations: Automatic or Manual Rebalancing

我们忽略了关于重新平衡的一个重要问题:重新平衡是自动发生还是手动发生?

There is one important question with regard to rebalancing that we have glossed over: does the rebalancing happen automatically or manually?

全自动重新平衡(系统自动决定何时将分区从一个节点移动到另一个节点,无需任何管理员交互)和完全手动(节点的分区分配由管理员显式配置,并且仅在以下情况下更改)之间存在梯度:管理员明确地重新配置它)。例如,Couchbase、Riak 和 Voldemort 自动生成建议的分区分配,但需要管理员提交才能生效。

There is a gradient between fully automatic rebalancing (the system decides automatically when to move partitions from one node to another, without any administrator interaction) and fully manual (the assignment of partitions to nodes is explicitly configured by an administrator, and only changes when the administrator explicitly reconfigures it). For example, Couchbase, Riak, and Voldemort generate a suggested partition assignment automatically, but require an administrator to commit it before it takes effect.

全自动重新平衡非常方便,因为正常维护所需的操作工作较少。然而,它可能是不可预测的。重新平衡是一项昂贵的操作,因为它需要重新路由请求并将大量数据从一个节点移动到另一个节点。如果不小心完成,此过程可能会使网络或节点过载,并在重新平衡过程中损害其他请求的性能。

Fully automated rebalancing can be convenient, because there is less operational work to do for normal maintenance. However, it can be unpredictable. Rebalancing is an expensive operation, because it requires rerouting requests and moving a large amount of data from one node to another. If it is not done carefully, this process can overload the network or the nodes and harm the performance of other requests while the rebalancing is in progress.

这种自动化与自动故障检测结合起来可能会很危险。例如,假设一个节点过载,并且暂时响应请求很慢。其他节点得出结论,过载的节点已死亡,并自动重新平衡集群以将负载移离该节点。这会给过载的节点、其他节点和网络带来额外的负载,使情况变得更糟,并可能导致级联故障。

Such automation can be dangerous in combination with automatic failure detection. For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This puts additional load on the overloaded node, other nodes, and the network—making the situation worse and potentially causing a cascading failure.

因此,有人参与重新平衡可能是一件好事。它比全自动过程慢,但有助于防止操作意外。

For that reason, it can be a good thing to have a human in the loop for rebalancing. It’s slower than a fully automatic process, but it can help prevent operational surprises.

请求路由

Request Routing

现在,我们已将数据集划分到在多台计算机上运行的多个节点上。但仍然存在一个悬而未决的问题:当客户端想要发出请求时,它如何知道要连接到哪个节点?随着分区重新平衡,分区到节点的分配也会发生变化。有人需要掌握这些变化才能回答这个问题:如果我想读取或写入键“foo”,我需要连接到哪个 IP 地址和端口号?

We have now partitioned our dataset across multiple nodes running on multiple machines. But there remains an open question: when a client wants to make a request, how does it know which node to connect to? As partitions are rebalanced, the assignment of partitions to nodes changes. Somebody needs to stay on top of those changes in order to answer the question: if I want to read or write the key “foo”, which IP address and port number do I need to connect to?

这是称为服务发现的 更普遍问题的一个实例,该问题不仅限于数据库。任何可通过网络访问的软件都存在此问题,特别是如果它的目标是高可用性(在多台计算机上以冗余配置运行)。许多公司已经编写了自己的内部服务发现工具,其中许多已经作为开源发布[ 30 ]。

This is an instance of a more general problem called service discovery, which isn’t limited to just databases. Any piece of software that is accessible over a network has this problem, especially if it is aiming for high availability (running in a redundant configuration on multiple machines). Many companies have written their own in-house service discovery tools, and many of these have been released as open source [30].

从较高的层面来看,有几种不同的方法可以解决这个问题( 如图 6-7所示):

On a high level, there are a few different approaches to this problem (illustrated in Figure 6-7):

  1. 允许客户端联系任何节点(例如,通过循环负载平衡器)。如果该节点恰好拥有该请求所适用的分区,则它可以直接处理该请求;否则,它将请求转发到适当的节点,接收答复,并将答复传递给客户端。

  2. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node coincidentally owns the partition to which the request applies, it can handle the request directly; otherwise, it forwards the request to the appropriate node, receives the reply, and passes the reply along to the client.

  3. 首先将来自客户端的所有请求发送到路由层,路由层确定应处理每个请求的节点并相应地转发它。该路由层本身不处理任何请求;它仅充当分区感知负载平衡器。

  4. Send all requests from clients to a routing tier first, which determines the node that should handle each request and forwards it accordingly. This routing tier does not itself handle any requests; it only acts as a partition-aware load balancer.

  5. 要求客户端了解分区以及分区到节点的分配。在这种情况下,客户端可以直接连接到适当的节点,无需任何中介。

  6. Require that clients be aware of the partitioning and the assignment of partitions to nodes. In this case, a client can connect directly to the appropriate node, without any intermediary.

在所有情况下,关键问题是:做出路由决策的组件(可能是节点之一、路由层或客户端)如何了解节点分区分配的变化?

In all cases, the key problem is: how does the component making the routing decision (which may be one of the nodes, or the routing tier, or the client) learn about changes in the assignment of partitions to nodes?

直达0607
图 6-7。将请求路由到正确节点的三种不同方式。

这是一个具有挑战性的问题,因为所有参与者都同意这一点很重要,否则请求将被发送到错误的节点并且无法正确处理。有一些协议可以在分布式系统中达成共识,但它们很难正确实现(参见 第 9 章)。

This is a challenging problem, because it is important that all participants agree—otherwise requests would be sent to the wrong nodes and not handled correctly. There are protocols for achieving consensus in a distributed system, but they are hard to implement correctly (see Chapter 9).

许多分布式数据系统依赖单独的协调服务(例如 ZooKeeper)来跟踪该集群元数据,如图6-8所示。每个节点都在ZooKeeper中注册自己,ZooKeeper维护分区到节点的权威映射。其他参与者(例如路由层或分区感知客户端)可以在 ZooKeeper 中订阅此信息。每当分区更改所有权或添加或删除节点时,ZooKeeper 都会通知路由层,以便它可以保持其路由信息最新。

Many distributed data systems rely on a separate coordination service such as ZooKeeper to keep track of this cluster metadata, as illustrated in Figure 6-8. Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of partitions to nodes. Other actors, such as the routing tier or the partitioning-aware client, can subscribe to this information in ZooKeeper. Whenever a partition changes ownership, or a node is added or removed, ZooKeeper notifies the routing tier so that it can keep its routing information up to date.

迪迪亚0608
图 6-8。使用 ZooKeeper 跟踪节点的分区分配。

例如,LinkedIn 的 Espresso 使用 Helix [ 31 ] 进行集群管理(而集群管理又依赖于 ZooKeeper),实现了 如图 6-8所示的路由层。HBase、SolrCloud 和 Kafka 也使用 ZooKeeper 来跟踪分区分配。MongoDB 有类似的架构,但它依赖自己的配置服务器 实现和mongos守护进程作为路由层。

For example, LinkedIn’s Espresso uses Helix [31] for cluster management (which in turn relies on ZooKeeper), implementing a routing tier as shown in Figure 6-8. HBase, SolrCloud, and Kafka also use ZooKeeper to track partition assignment. MongoDB has a similar architecture, but it relies on its own config server implementation and mongos daemons as the routing tier.

Cassandra 和 Riak 采用不同的方法:他们在节点之间使用八卦协议来传播集群状态的任何变化。请求可以发送到任何节点,该节点将它们转发到所请求分区的适当节点( 图 6-7中的方法 1 )。该模型增加了数据库节点的复杂性,但避免了对外部协调服务(例如 ZooKeeper)的依赖。

Cassandra and Riak take a different approach: they use a gossip protocol among the nodes to disseminate any changes in cluster state. Requests can be sent to any node, and that node forwards them to the appropriate node for the requested partition (approach 1 in Figure 6-7). This model puts more complexity in the database nodes but avoids the dependency on an external coordination service such as ZooKeeper.

Couchbase 不会自动重新平衡,这简化了设计。通常它配置有一个名为moxi的路由层,它从集群节点学习路由变化 [ 32 ]。

Couchbase does not rebalance automatically, which simplifies the design. Normally it is configured with a routing tier called moxi, which learns about routing changes from the cluster nodes [32].

当使用路由层或向随机节点发送请求时,客户端仍然需要找到要连接的 IP 地址。这些并不像节点的分区分配那样快速变化,因此通常使用 DNS 就足以实现此目的。

When using a routing tier or when sending requests to a random node, clients still need to find the IP addresses to connect to. These are not as fast-changing as the assignment of partitions to nodes, so it is often sufficient to use DNS for this purpose.

并行查询执行

Parallel Query Execution

到目前为止,我们关注的是读取或写入单个键的非常简单的查询(加上文档分区二级索引的分散/聚集查询)。这与大多数 NoSQL 分布式数据存储支持的访问级别有关。

So far we have focused on very simple queries that read or write a single key (plus scatter/gather queries in the case of document-partitioned secondary indexes). This is about the level of access supported by most NoSQL distributed datastores.

然而,通常用于分析的大规模并行处理(MPP) 关系数据库产品在其支持的查询类型方面要复杂得多。典型的数据仓库查询包含多个连接、过滤、分组和聚合操作。MPP 查询优化器将这个复杂的查询分解为多个执行阶段和分区,其中许多可以在数据库集群的不同节点上并行执行。涉及扫描大部分数据集的查询特别受益于这种并行执行。

However, massively parallel processing (MPP) relational database products, often used for analytics, are much more sophisticated in the types of queries they support. A typical data warehouse query contains several join, filtering, grouping, and aggregation operations. The MPP query optimizer breaks this complex query into a number of execution stages and partitions, many of which can be executed in parallel on different nodes of the database cluster. Queries that involve scanning over large parts of the dataset particularly benefit from such parallel execution.

数据仓库查询的快速并行执行是一个专门的主题,考虑到分析的业务重要性,它受到了很大的商业兴趣。我们将在第 10 章讨论并行查询执行的一些技术。有关并行数据库中使用的技术的更详细概述,请参阅参考文献 [ 1 , 33 ]。

Fast parallel execution of data warehouse queries is a specialized topic, and given the business importance of analytics, it receives a lot of commercial interest. We will discuss some techniques for parallel query execution in Chapter 10. For a more detailed overview of techniques used in parallel databases, please see the references [1, 33].

概括

Summary

在本章中,我们探讨了将大型数据集划分为较小子集的不同方法。当您拥有如此多的数据以至于在一台机器上存储和处理数据不再可行时,分区是必要的。

In this chapter we explored different ways of partitioning a large dataset into smaller subsets. Partitioning is necessary when you have so much data that storing and processing it on a single machine is no longer feasible.

分区的目标是将数据和查询负载均匀分布在多台机器上,避免热点(负载不成比例的高节点)。这需要选择适合您的数据的分区方案,并在向集群添加或删除节点时重新平衡分区。

The goal of partitioning is to spread the data and query load evenly across multiple machines, avoiding hot spots (nodes with disproportionately high load). This requires choosing a partitioning scheme that is appropriate to your data, and rebalancing the partitions when nodes are added to or removed from the cluster.

我们讨论了两种主要的分区方法:

We discussed two main approaches to partitioning:

  • 键范围分区,其中键被排序,并且分区拥有从最小到最大的所有键。排序的优点是可以进行高效的范围查询,但如果应用程序经常访问按排序顺序靠近的键,则存在热点风险。

    在这种方法中,当分区变得太大时,通常通过将范围分成两个子范围来动态地重新平衡分区。

  • Key range partitioning, where keys are sorted, and a partition owns all the keys from some minimum up to some maximum. Sorting has the advantage that efficient range queries are possible, but there is a risk of hot spots if the application often accesses keys that are close together in the sorted order.

    In this approach, partitions are typically rebalanced dynamically by splitting the range into two subranges when a partition gets too big.

  • 哈希分区,其中哈希函数应用于每个键,并且分区拥有一系列哈希值。此方法破坏了键的顺序,使范围查询效率低下,但可以更均匀地分配负载。

    当通过哈希进行分区时,通常会提前创建固定数量的分区,为每个节点分配多个分区,并在添加或删除节点时将整个分区从一个节点移动到另一个节点。也可以使用动态分区。

  • Hash partitioning, where a hash function is applied to each key, and a partition owns a range of hashes. This method destroys the ordering of keys, making range queries inefficient, but may distribute load more evenly.

    When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire partitions from one node to another when nodes are added or removed. Dynamic partitioning can also be used.

混合方法也是可能的,例如使用复合键:使用键的一部分来标识分区,另一部分用于排序顺序。

Hybrid approaches are also possible, for example with a compound key: using one part of the key to identify the partition and another part for the sort order.

我们还讨论了分区和二级索引之间的相互作用。二级索引也需要分区,有两种方法:

We also discussed the interaction between partitioning and secondary indexes. A secondary index also needs to be partitioned, and there are two methods:

  • 文档分区索引(本地索引),其中辅助索引与主键和值存储在同一分区中。这意味着写入时只需要更新单个分区,但读取二级索引需要跨所有分区进行分散/聚集。

  • Document-partitioned indexes (local indexes), where the secondary indexes are stored in the same partition as the primary key and value. This means that only a single partition needs to be updated on write, but a read of the secondary index requires a scatter/gather across all partitions.

  • 术语分区索引(全局索引),其中二级索引使用索引值单独分区。二级索引中的条目可以包括来自主键的所有分区的记录。当一个文档写入时,二级索引的几个分区需要更新;但是,可以从单个分区提供读取服务。

  • Term-partitioned indexes (global indexes), where the secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is written, several partitions of the secondary index need to be updated; however, a read can be served from a single partition.

最后,我们讨论了将查询路由到适当分区的技术,其范围从简单的分区感知负载平衡到复杂的并行查询执行引擎。

Finally, we discussed techniques for routing queries to the appropriate partition, which range from simple partition-aware load balancing to sophisticated parallel query execution engines.

根据设计,每个分区基本上都是独立运行的,这使得分区数据库可以扩展到多台计算机。然而,需要写入多个分区的操作可能很难推理:例如,如果对一个分区的写入成功,但另一个分区失败,会发生什么情况?我们将在接下来的章节中解决这个问题。

By design, every partition operates mostly independently—that’s what allows a partitioned database to scale to multiple machines. However, operations that need to write to several partitions can be difficult to reason about: for example, what happens if the write to one partition succeeds, but another fails? We will address that question in the following chapters.

脚注

正如本章所讨论的,分区是一种有意将大型数据库分解为较小数据库的方法。它与网络分区(netsplits)无关,网络分区是节点之间网络中的一种故障。我们将在第 8 章中讨论此类故障。

i Partitioning, as discussed in this chapter, is a way of intentionally breaking a large database down into smaller ones. It has nothing to do with network partitions (netsplits), a type of fault in the network between nodes. We will discuss such faults in Chapter 8.

ii如果您的数据库仅支持键值模型,您可能会想通过在应用程序代码中创建从值到文档 ID 的映射来自己实现二级索引。如果您走这条路,您需要非常小心,以确保您的索引与基础数据保持一致。竞争条件和间歇性写入失败(其中一些更改已保存,但其他更改未保存)很容易导致数据不同步 - 请参阅“多对象事务的需求”

ii If your database only supports a key-value model, you might be tempted to implement a secondary index yourself by creating a mapping from values to document IDs in application code. If you go down this route, you need to take great care to ensure your indexes remain consistent with the underlying data. Race conditions and intermittent write failures (where some changes were saved but others weren’t) can very easily cause the data to go out of sync—see “The need for multi-object transactions”.

参考

[ 1 ] David J. DeWitt 和 Jim N. Gray:“并行数据库系统:高性能数据库系统的未来”, Communications of the ACM,第 35 卷,第 6 期,第 85-98 页,1992 年 6 月 。doi:10.1145/ 129888.129894

[1] David J. DeWitt and Jim N. Gray: “Parallel Database Systems: The Future of High Performance Database Systems,” Communications of the ACM, volume 35, number 6, pages 85–98, June 1992. doi:10.1145/129888.129894

[ 2 ] Lars George:“ HBase 与 BigTable 比较”, larsgeorge.com,2009 年 11 月。

[2] Lars George: “HBase vs. BigTable Comparison,” larsgeorge.com, November 2009.

[ 3 ]“ Apache HBase 参考指南”,Apache 软件基金会,hbase.apache.org,2014 年。

[3] “The Apache HBase Reference Guide,” Apache Software Foundation, hbase.apache.org, 2014.

[ 4 ] MongoDB, Inc.:“MongoDB 2.4 中新的基于哈希的分片功能”,blog.mongodb.org,2013 年 4 月 10 日。

[4] MongoDB, Inc.: “New Hash-Based Sharding Feature in MongoDB 2.4,” blog.mongodb.org, April 10, 2013.

[ 5 ] Ikai Lan:“ App Engine 数据存储提示:单调递增的值很糟糕”,ikaaisays.com,2011 年 1 月 25 日。

[5] Ikai Lan: “App Engine Datastore Tip: Monotonically Increasing Values Are Bad,” ikaisays.com, January 25, 2011.

[ 6 ] Martin Kleppmann:“ Java 的 hashCode 对于分布式系统来说并不安全”,martin.kleppmann.com,2012 年 6 月 18 日。

[6] Martin Kleppmann: “Java’s hashCode Is Not Safe for Distributed Systems,” martin.kleppmann.com, June 18, 2012.

[ 7 ] David Karger、Eric Lehman、Tom Leighton 等人:“一致性哈希和随机树:用于缓解万维网上热点的分布式缓存协议”,第 29 届 ACM 计算理论年度研讨会(STOC),第 654–663 页,1997 年 。doi:10.1145/258533.258660

[7] David Karger, Eric Lehman, Tom Leighton, et al.: “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web,” at 29th Annual ACM Symposium on Theory of Computing (STOC), pages 654–663, 1997. doi:10.1145/258533.258660

[ 8 ] John Lamping 和 Eric Veach:“一种快速、最小内存、一致的哈希算法”,arxiv.org,2014 年 6 月。

[8] John Lamping and Eric Veach: “A Fast, Minimal Memory, Consistent Hash Algorithm,” arxiv.org, June 2014.

[ 9 ] Eric Redmond:“ Riak 小书”,版本 1.4.0,Basho Technologies,2013 年 9 月。

[9] Eric Redmond: “A Little Riak Book,” Version 1.4.0, Basho Technologies, September 2013.

[ 10 ]“ Couchbase 2.5 管理员指南”,Couchbase, Inc.,2014 年。

[10] “Couchbase 2.5 Administrator Guide,” Couchbase, Inc., 2014.

[ 11 ] Avinash Lakshman 和 Prashant Malik:“ Cassandra – 分散式结构化存储系统”,第三届 ACM SIGOPS 国际大型分布式系统和中间件研讨会(LADIS),2009 年 10 月。

[11] Avinash Lakshman and Prashant Malik: “Cassandra – A Decentralized Structured Storage System,” at 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), October 2009.

[ 12 ] Jonathan Ellis:“ Facebook 的 Cassandra 论文,带注释并与 Apache Cassandra 2.0 进行比较” , datastax.com,2013 年 9 月 12 日。

[12] Jonathan Ellis: “Facebook’s Cassandra Paper, Annotated and Compared to Apache Cassandra 2.0,” datastax.com, September 12, 2013.

[ 13 ]“ Cassandra 查询语言简介”,DataStax, Inc.,2014 年。

[13] “Introduction to Cassandra Query Language,” DataStax, Inc., 2014.

[ 14 ] Samuel Axon:“ 3% 的 Twitter 服务器专用于 Justin Bieber ”,mashable.com,2010 年 9 月 7 日。

[14] Samuel Axon: “3% of Twitter’s Servers Dedicated to Justin Bieber,” mashable.com, September 7, 2010.

[ 15 ]“ Riak 1.4.8 文档”,Basho Technologies, Inc.,2014 年。

[15] “Riak 1.4.8 Docs,” Basho Technologies, Inc., 2014.

[ 16 ] Richard Low:“ Cassandra 二级索引的最佳点”,wentnet.com,2013 年 10 月 21 日。

[16] Richard Low: “The Sweet Spot for Cassandra Secondary Indexing,” wentnet.com, October 21, 2013.

[ 17 ] Zachary Tong:“自定义您的文档路由”,elasticsearch.org,2013 年 6 月 3 日。

[17] Zachary Tong: “Customizing Your Document Routing,” elasticsearch.org, June 3, 2013.

[ 18 ]“ Apache Solr 参考指南”,Apache 软件基金会,2014 年。

[18] “Apache Solr Reference Guide,” Apache Software Foundation, 2014.

[ 19 ] Andrew Pavlo:“ H-Store 常见问题”, hstore.cs.brown.edu,2013 年 10 月。

[19] Andrew Pavlo: “H-Store Frequently Asked Questions,” hstore.cs.brown.edu, October 2013.

[ 20 ]“ Amazon DynamoDB 开发人员指南”,Amazon Web Services, Inc.,2014 年。

[20] “Amazon DynamoDB Developer Guide,” Amazon Web Services, Inc., 2014.

[ 21 ] Rusty Klophaus:“ 2I 和搜索之间的差异”,发送给riak-users邮件列表的电子邮件,lists.basho.com,2011 年 10 月 25 日。

[21] Rusty Klophaus: “Difference Between 2I and Search,” email to riak-users mailing list, lists.basho.com, October 25, 2011.

[ 22 ] Donald K. Burleson:“ Oracle 中的对象分区”, dba-oracle.com,2000 年 11 月 8 日。

[22] Donald K. Burleson: “Object Partitioning in Oracle,” dba-oracle.com, November 8, 2000.

[ 23 ] Eric Evans:“重新思考 Cassandra 中的拓扑”,ApacheCon Europe,2012 年 11 月。

[23] Eric Evans: “Rethinking Topology in Cassandra,” at ApacheCon Europe, November 2012.

[ 24 ] Rafał Kuć:“重新路由 API 解释”, elasticsearchserverbook.com,2013 年 9 月 30 日。

[24] Rafał Kuć: “Reroute API Explained,” elasticsearchserverbook.com, September 30, 2013.

[ 25 ]“伏地魔项目文档”,project-voldemort.com

[25] “Project Voldemort Documentation,” project-voldemort.com.

[ 26 ] Enis Soztutar:“ Apache HBase 区域拆分和合并”,hortonworks.com,2013 年 2 月 1 日。

[26] Enis Soztutar: “Apache HBase Region Splitting and Merging,” hortonworks.com, February 1, 2013.

[ 27 ] Brandon Williams:“ Cassandra 1.2 中的虚拟节点”,datastax.com,2012 年 12 月 4 日。

[27] Brandon Williams: “Virtual Nodes in Cassandra 1.2,” datastax.com, December 4, 2012.

[ 28 ] Richard Jones:“ libketama:Memcached 客户端的一致性哈希库”,metabrew.com,2007 年 4 月 10 日。

[28] Richard Jones: “libketama: Consistent Hashing Library for Memcached Clients,” metabrew.com, April 10, 2007.

[ 29 ] Branimir Lambov:“ Cassandra 3.0 中的新代币分配算法”,datastax.com,2016 年 1 月 28 日。

[29] Branimir Lambov: “New Token Allocation Algorithm in Cassandra 3.0,” datastax.com, January 28, 2016.

[ 30 ] Jason Wilder:“开源服务发现”,jasonwilder.com,2014 年 2 月。

[30] Jason Wilder: “Open-Source Service Discovery,” jasonwilder.com, February 2014.

[ 31 ] Kishore Gopalakrishna、Shi Lu、Zhen Zhang 等人:“ Untangling Cluster Management with Helix ”,ACM 云计算研讨会(SoCC),2012 年 10 月 。doi:10.1145/2391229.2391248

[31] Kishore Gopalakrishna, Shi Lu, Zhen Zhang, et al.: “Untangling Cluster Management with Helix,” at ACM Symposium on Cloud Computing (SoCC), October 2012. doi:10.1145/2391229.2391248

[ 32 ]“ Moxi 1.8 手册”,Couchbase, Inc.,2014 年。

[32] “Moxi 1.8 Manual,” Couchbase, Inc., 2014.

[ 33 ] Shivnath Babu 和 Herodotos Herodotou:“大规模并行数据库和 MapReduce 系统”,数据库基础与趋势,第 5 卷,第 1 期,第 1-104 页,2013 年 11 月 。doi:10.1561/1900000036

[33] Shivnath Babu and Herodotos Herodotou: “Massively Parallel Databases and MapReduce Systems,” Foundations and Trends in Databases, volume 5, number 1, pages 1–104, November 2013. doi:10.1561/1900000036

第 7 章交易

Chapter 7. Transactions

一些作者声称,一般的两阶段提交成本太高而无法支持,因为它带来了性能或可用性问题。我们认为,最好让应用程序程序员在瓶颈出现时处理由于过度使用事务而导致的性能问题,而不是总是围绕事务的缺乏进行编码。

James Corbett 等人,Spanner:Google 的全球分布式数据库(2012)

Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.

James Corbett et al., Spanner: Google’s Globally-Distributed Database (2012)

在数据系统的严酷现实中,很多事情都可能出错:

In the harsh reality of data systems, many things can go wrong:

  • 数据库软件或硬件可能随时发生故障(包括在写操作过程中)。

  • The database software or hardware may fail at any time (including in the middle of a write operation).

  • 应用程序可能随时崩溃(包括一系列操作的中途)。

  • The application may crash at any time (including halfway through a series of operations).

  • 网络中断可能会意外地将应用程序与数据库断开,或者将一个数据库节点与另一个数据库节点断开。

  • Interruptions in the network can unexpectedly cut off the application from the database, or one database node from another.

  • 多个客户端可能会同时写入数据库,覆盖彼此的更改。

  • Several clients may write to the database at the same time, overwriting each other’s changes.

  • 客户端可能会读取没有意义的数据,因为它仅部分更新。

  • A client may read data that doesn’t make sense because it has only partially been updated.

  • 客户端之间的竞争条件可能会导致令人惊讶的错误。

  • Race conditions between clients can cause surprising bugs.

为了可靠,系统必须处理这些故障并确保它们不会导致整个系统发生灾难性故障。然而,实现容错机制需要大量工作。它需要对所有可能出错的事情进行大量仔细的思考,并进行大量测试以确保解决方案确实有效。

In order to be reliable, a system has to deal with these faults and ensure that they don’t cause catastrophic failure of the entire system. However, implementing fault-tolerance mechanisms is a lot of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of testing to ensure that the solution actually works.

几十年来,交易一直是简化这些问题的首选机制。事务是应用程序将多个读取和写入组合成一个逻辑单元的一种方式。从概念上讲,事务中的所有读取和写入都作为一个操作执行:整个事务要么成功(提交),要么失败(中止回滚)。如果失败,应用程序可以安全地重试。通过事务,应用程序的错误处理变得更加简单,因为它不需要担心部分失败,即某些操作成功而某些操作失败(无论出于何种原因)的情况。

For decades, transactions have been the mechanism of choice for simplifying these issues. A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback). If it fails, the application can safely retry. With transactions, error handling becomes much simpler for an application, because it doesn’t need to worry about partial failure—i.e., the case where some operations succeed and some fail (for whatever reason).

如果您花了多年时间处理事务,它们可能看起来很明显,但我们不应该认为它们是理所当然的。交易不是自然法则;创建它们的目的是 简化访问数据库的应用程序的编程模型。通过使用事务,应用程序可以自由地忽略某些潜在的错误场景和并发问题,因为数据库会处理它们(我们称之为安全保证)。

If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them for granted. Transactions are not a law of nature; they were created with a purpose, namely to simplify the programming model for applications accessing a database. By using transactions, the application is free to ignore certain potential error scenarios and concurrency issues, because the database takes care of them instead (we call these safety guarantees).

并非每个应用程序都需要事务,有时削弱事务保证或完全放弃它们有好处(例如,为了实现更高的性能或更高的可用性)。一些安全属性可以在没有交易的情况下实现。

Not every application needs transactions, and sometimes there are advantages to weakening transactional guarantees or abandoning them entirely (for example, to achieve higher performance or higher availability). Some safety properties can be achieved without transactions.

如何判断是否需要交易?为了回答这个问题,我们首先需要准确了解交易可以提供哪些安全保证,以及与之相关的成本。尽管交易乍一看似乎很简单,但实际上有许多微妙但重要的细节在发挥作用。

How do you figure out whether you need transactions? In order to answer that question, we first need to understand exactly what safety guarantees transactions can provide, and what costs are associated with them. Although transactions seem straightforward at first glance, there are actually many subtle but important details that come into play.

在本章中,我们将研究许多可能出错的示例,并探讨数据库用于防范这些问题的算法。我们将特别深入探讨并发控制领域,讨论可能发生的各种竞争条件以及数据库如何实现隔离级别,例如已提交读快照隔离可串行性

In this chapter, we will examine many examples of things that can go wrong, and explore the algorithms that databases use to guard against those issues. We will go especially deep in the area of concurrency control, discussing various kinds of race conditions that can occur and how databases implement isolation levels such as read committed, snapshot isolation, and serializability.

本章适用于单节点和分布式数据库;在第 8 章中,我们将重点讨论仅在分布式系统中出现的特定挑战。

This chapter applies to both single-node and distributed databases; in Chapter 8 we will focus the discussion on the particular challenges that arise only in distributed systems.

交易的狡猾概念

The Slippery Concept of a Transaction

当今几乎所有关系数据库和一些非关系数据库都支持事务。其中大多数遵循1975 年 IBM System R(第一个 SQL数据库) 引入的风格 [ 1,2,3 ]。尽管一些实现细节发生了变化,但 40 年来总体思路几乎保持不变:MySQL、PostgreSQL、Oracle、SQL Server 等中的事务支持与 System R 惊人地相似。

Almost all relational databases today, and some nonrelational databases, support transactions. Most of them follow the style that was introduced in 1975 by IBM System R, the first SQL database [1, 2, 3]. Although some implementation details have changed, the general idea has remained virtually the same for 40 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily similar to that of System R.

2000 年代末,非关系型 (NoSQL) 数据库开始流行。他们的目标是通过提供新数据模型的选择(参见 第 2 章)以及默认包含复制(第 5 章)和分区(第 6 章)来改善关系现状。事务是这场运动的主要受害者:许多新一代数据库完全放弃了事务,或者重新定义了这个词来描述一组比以前理解的弱得多的保证[4 ]

In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to improve upon the relational status quo by offering a choice of new data models (see Chapter 2), and by including replication (Chapter 5) and partitioning (Chapter 6) by default. Transactions were the main casualty of this movement: many of this new generation of databases abandoned transactions entirely, or redefined the word to describe a much weaker set of guarantees than had previously been understood [4].

随着这种新型分布式数据库的大肆宣传,人们普遍认为事务是可扩展性的对立面,任何大型系统都必须放弃事务才能保持良好的性能和高可用性 [ 5 , 6 ] 。另一方面,数据库供应商有时将事务保证作为具有“有价值数据”的“严肃应用程序”的基本要求。这两种观点都纯粹是夸张的。

With the hype around this new crop of distributed databases, there emerged a popular belief that transactions were the antithesis of scalability, and that any large-scale system would have to abandon transactions in order to maintain good performance and high availability [5, 6]. On the other hand, transactional guarantees are sometimes presented by database vendors as an essential requirement for “serious applications” with “valuable data.” Both viewpoints are pure hyperbole.

事实并非那么简单:与其他所有技术设计选择一样,交易也有优点和局限性。为了理解这些权衡,让我们详细了解交易可以提供的保证的细节——无论是在正常操作还是在各种极端(但现实)的情况下。

The truth is not that simple: like every other technical design choice, transactions have advantages and limitations. In order to understand those trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal operation and in various extreme (but realistic) circumstances.

ACID的含义

The Meaning of ACID

事务提供的安全保证通常用著名的缩写ACID来描述,它代表原子性一致性隔离性持久性。它是由 Theo Härder 和 Andreas Reuter [ 7 ] 于 1983 年创造的,旨在为数据库中的容错机制建立精确的术语。

The safety guarantees provided by transactions are often described by the well-known acronym ACID, which stands for Atomicity, Consistency, Isolation, and Durability. It was coined in 1983 by Theo Härder and Andreas Reuter [7] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.

然而,在实践中,一个数据库的 ACID 实现并不等于另一个数据库的实现。例如,正如我们将看到的,隔离的含义存在很多歧义 [ 8 ]。高层次的想法是合理的,但问题在于细节。如今,当一个系统声称“符合 ACID”时,您实际上并不清楚可以期待什么保证。不幸的是,ACID 主要成为了一个营销术语。

However, in practice, one database’s implementation of ACID does not equal another’s implementation. For example, as we shall see, there is a lot of ambiguity around the meaning of isolation [8]. The high-level idea is sound, but the devil is in the details. Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term.

(不满足 ACID 标准的系统有时被称为BASE,它代表 基本可用软状态最终一致性 [ 9 ]。这比 ACID 的定义更加模糊。似乎 BASE 唯一合理的定义是“非 ACID”;即,它几乎可以表示您想要的任何内容。)

(Systems that do not meet the ACID criteria are sometimes called BASE, which stands for Basically Available, Soft state, and Eventual consistency [9]. This is even more vague than the definition of ACID. It seems that the only sensible definition of BASE is “not ACID”; i.e., it can mean almost anything you want.)

让我们深入研究原子性、一致性、隔离性和持久性的定义,因为这将使我们完善事务的概念。

Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let us refine our idea of transactions.

原子性

Atomicity

一般来说,原子是指不能分解成更小的部分的东西。这个词在不同的计算分支中含义相似但略有不同。例如,在多线程编程中,如果一个线程执行原子操作,则意味着另一个线程无法看到该操作的半完成结果。系统只能处于操作之前或操作之后的状态,而不能处于两者之间的状态。

In general, atomic refers to something that cannot be broken down into smaller parts. The word means similar but subtly different things in different branches of computing. For example, in multi-threaded programming, if one thread executes an atomic operation, that means there is no way that another thread could see the half-finished result of the operation. The system can only be in the state it was before the operation or after the operation, not something in between.

相比之下,在 ACID 的上下文中,原子性与并发性无关。它没有描述如果多个进程尝试同时访问相同的数据会发生什么,因为这被字母I覆盖,用于隔离(请参阅“隔离”)。

By contrast, in the context of ACID, atomicity is not about concurrency. It does not describe what happens if several processes try to access the same data at the same time, because that is covered under the letter I, for isolation (see “Isolation”).

相反,ACID 原子性描述的是如果客户端想要进行多次写入,但在处理某些写入后发生故障,例如进程崩溃、网络连接中断、磁盘已满或某些完整性发生故障,则会发生什么情况约束被违反。 如果写入被分组到一个原子事务中,并且该事务由于错误而 无法完成(提交),则该事务将中止,并且数据库必须丢弃或撤消迄今为止在该事务中所做的任何写入。

Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed—for example, a process crashes, a network connection is interrupted, a disk becomes full, or some integrity constraint is violated. If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction.

如果没有原子性,如果在进行多个更改的过程中发生错误,则很难知道哪些更改已生效,哪些尚未生效。应用程序可以重试,但这可能会导致两次相同的更改,从而导致数据重复或不正确。原子性简化了这个问题:如果事务被中止,应用程序可以确保它没有更改任何内容,因此可以安全地重试。

Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to know which changes have taken effect and which haven’t. The application could try again, but that risks making the same change twice, leading to duplicate or incorrect data. Atomicity simplifies this problem: if a transaction was aborted, the application can be sure that it didn’t change anything, so it can safely be retried.

发生错误时中止事务并丢弃该事务的所有写入的能力是 ACID 原子性的定义特征。也许“可中止性”是比 “原子性”更好的术语,但我们将坚持使用“原子性”,因为这是常用的词。

The ability to abort a transaction on error and have all writes from that transaction discarded is the defining feature of ACID atomicity. Perhaps abortability would have been a better term than atomicity, but we will stick with atomicity since that’s the usual word.

一致性

Consistency

一致性 这个词用得太重了:

The word consistency is terribly overloaded:

  • 第 5 章中,我们讨论了副本一致性以及 异步复制系统中出现的最终一致性问题(请参阅“复制滞后问题”)。

  • In Chapter 5 we discussed replica consistency and the issue of eventual consistency that arises in asynchronously replicated systems (see “Problems with Replication Lag”).

  • 一致性哈希是一些系统用于重新平衡的分区方法(请参阅 “一致性哈希”)。

  • Consistent hashing is an approach to partitioning that some systems use for rebalancing (see “Consistent Hashing”).

  • 在 CAP 定理(参见第 9 章)中,一致性一词用于表示 线性化(参见“线性化”)。

  • In the CAP theorem (see Chapter 9), the word consistency is used to mean linearizability (see “Linearizability”).

  • 在 ACID 的上下文中,一致性是指数据库处于“良好状态”的特定于应用程序的概念。

  • In the context of ACID, consistency refers to an application-specific notion of the database being in a “good state.”

不幸的是,同一个词至少有四种不同的含义。

It’s unfortunate that the same word is used with at least four different meanings.

ACID 一致性的想法是,您对数据(不变量)有某些声明,这些声明必须始终正确,例如,在会计系统中,所有帐户的贷方和借方必须始终保持平衡。如果事务从根据这些不变量有效的数据库开始,并且事务期间的任何写入都保持有效性,那么您可以确保始终满足不变量。

The idea of ACID consistency is that you have certain statements about your data (invariants) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. If a transaction starts with a database that is valid according to these invariants, and any writes during the transaction preserve the validity, then you can be sure that the invariants are always satisfied.

然而,这种一致性的想法取决于应用程序的不变量概念,并且应用程序有责任正确定义其事务,以便它们保持一致性。这不是数据库可以保证的:如果您编写了违反不变量的不良数据,数据库无法阻止您。(数据库可以检查某些特定类型的不变量,例如使用外键约束或唯一性约束。但是,一般来说,应用程序定义哪些数据有效或无效 - 数据库仅存储它。)

However, this idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly so that they preserve consistency. This is not something that the database can guarantee: if you write bad data that violates your invariants, the database can’t stop you. (Some specific kinds of invariants can be checked by the database, for example using foreign key constraints or uniqueness constraints. However, in general, the application defines what data is valid or invalid—the database only stores it.)

原子性、隔离性和持久性是数据库的属性,而一致性(在 ACID 意义上)是应用程序的属性。应用程序可能依赖数据库的原子性和隔离属性来实现一致性,但这不仅仅取决于数据库。因此,字母 C 并不真正属于 ACID。

Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID.i

隔离

Isolation

大多数数据库同时被多个客户端访问。如果他们读取和写入数据库的不同部分,这没有问题,但如果他们访问相同的数据库记录,则可能会遇到并发问题(竞争条件)。

Most databases are accessed by several clients at the same time. That is no problem if they are reading and writing different parts of the database, but if they are accessing the same database records, you can run into concurrency problems (race conditions).

图 7-1是此类问题的一个简单示例。假设您有两个客户端同时递增存储在数据库中的计数器。每个客户端都需要读取当前值,加1,然后将新值写回(假设数据库中没有内置增量操作)。在图 7-1中,计数器应该从 42 增加到 44,因为发生了两次增量,但由于竞争条件,它实际上只增加到 43。

Figure 7-1 is a simple example of this kind of problem. Say you have two clients simultaneously incrementing a counter that is stored in a database. Each client needs to read the current value, add 1, and write the new value back (assuming there is no increment operation built into the database). In Figure 7-1 the counter should have increased from 42 to 44, because two increments happened, but it actually only went to 43 because of the race condition.

ACID 意义上的隔离意味着并发执行的事务彼此隔离:它们不能踩到彼此的脚趾。经典的数据库教科书将隔离形式化为可串行化,这意味着每个事务都可以假装它是整个数据库上运行的唯一事务。数据库确保当事务提交时,结果与它们串行运行(一个接一个)相同,即使实际上它们可能是并发运行的[ 10 ]。

Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. The classic database textbooks formalize isolation as serializability, which means that each transaction can pretend that it is the only transaction running on the entire database. The database ensures that when the transactions have committed, the result is the same as if they had run serially (one after another), even though in reality they may have run concurrently [10].

迪迪亚0701
图 7-1。两个客户端之间同时递增计数器的竞争条件。

然而,在实践中,很少使用可序列化隔离,因为它会带来性能损失。一些流行的数据库,例如Oracle 11g,甚至没有实现它。在 Oracle 中,有一个称为“可序列化”的隔离级别,但它实际上实现了称为快照隔离的东西,这比可序列化性的保证更弱 [ 8 , 11 ]。我们将在“弱隔离级别”中探讨快照隔离和其他形式的隔离 。

However, in practice, serializable isolation is rarely used, because it carries a performance penalty. Some popular databases, such as Oracle 11g, don’t even implement it. In Oracle there is an isolation level called “serializable,” but it actually implements something called snapshot isolation, which is a weaker guarantee than serializability [8, 11]. We will explore snapshot isolation and other forms of isolation in “Weak Isolation Levels”.

耐用性

Durability

数据库系统的目的是提供一个安全的地方来存储数据而不必担心丢失。持久性是指一旦事务成功提交,它所写入的任何数据都不会被忘记,即使出现硬件故障或数据库崩溃。

The purpose of a database system is to provide a safe place where data can be stored without fear of losing it. Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

在单节点数据库中,持久性通常意味着数据已写入非易失性存储,例如硬盘驱动器或 SSD。它通常还涉及预写日志或类似日志(请参阅 “使 B 树可靠”),以便在磁盘上的数据结构损坏时进行恢复。在复制数据库中,持久性可能意味着数据已成功复制到一定数量的节点。为了提供持久性保证,数据库必须等到这些写入或复制完成后才能将事务报告为已成功提交。

In a single-node database, durability typically means that the data has been written to nonvolatile storage such as a hard drive or SSD. It usually also involves a write-ahead log or similar (see “Making B-trees reliable”), which allows recovery in the event that the data structures on disk are corrupted. In a replicated database, durability may mean that the data has been successfully copied to some number of nodes. In order to provide a durability guarantee, a database must wait until these writes or replications are complete before reporting a transaction as successfully committed.

正如“可靠性”中所讨论的,完美的耐用性是不存在的:如果您的所有硬盘和所有备份同时被破坏,那么您的数据库显然无法拯救您。

As discussed in “Reliability”, perfect durability does not exist: if all your hard disks and all your backups are destroyed at the same time, there’s obviously nothing your database can do to save you.

单对象和多对象操作

Single-Object and Multi-Object Operations

回顾一下,在 ACID 中,原子性和隔离性描述了如果客户端在同一事务中进行多次写入,数据库应该做什么:

To recap, in ACID, atomicity and isolation describe what the database should do if a client makes several writes within the same transaction:

原子性
Atomicity

如果在写入序列的中途发生错误,则应中止事务,并且应丢弃到该点为止的写入。换句话说,数据库通过提供“全有或全无”的保证,使您不必担心部分失败。

If an error occurs halfway through a sequence of writes, the transaction should be aborted, and the writes made up to that point should be discarded. In other words, the database saves you from having to worry about partial failure, by giving an all-or-nothing guarantee.

隔离
Isolation

并发运行的事务不应互相干扰。例如,如果一个事务进行多次写入,则另一个事务应该看到这些写入的全部或全部,但不能看到某些子集。

Concurrently running transactions shouldn’t interfere with each other. For example, if one transaction makes several writes, then another transaction should see either all or none of those writes, but not some subset.

这些定义假设您要同时修改多个对象(行、文档、记录)。如果需要保持多条数据同步,通常需要 这种多对象事务。图 7-2显示了电子邮件应用程序的示例。要显示用户未读消息的数量,您可以查询如下内容:

These definitions assume that you want to modify several objects (rows, documents, records) at once. Such multi-object transactions are often needed if several pieces of data need to be kept in sync. Figure 7-2 shows an example from an email application. To display the number of unread messages for a user, you could query something like:

SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true
SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true

但是,如果有很多电子邮件,您可能会发现此查询太慢,并决定将未读邮件的数量存储在单独的字段中(一种非规范化)。现在,每当收到新消息时,您还必须增加未读计数器,并且每当消息被标记为已读时,您还必须减少未读计数器。

However, you might find this query to be too slow if there are many emails, and decide to store the number of unread messages in a separate field (a kind of denormalization). Now, whenever a new message comes in, you have to increment the unread counter as well, and whenever a message is marked as read, you also have to decrement the unread counter.

图 7-2中,用户 2 遇到异常:邮箱列表显示一条未读邮件,但计数器显示零个未读邮件,因为计数器增量尚未发生。ii 隔离可以通过确保用户 2 看到插入的电子邮件和更新的计数器,或者两者都看不到,但不会出现不一致的中间点来防止此问题。

In Figure 7-2, user 2 experiences an anomaly: the mailbox listing shows an unread message, but the counter shows zero unread messages because the counter increment has not yet happened.ii Isolation would have prevented this issue by ensuring that user 2 sees either both the inserted email and the updated counter, or neither, but not an inconsistent halfway point.

迪迪亚0702
图 7-2。违反隔离:一个事务读取另一个事务未提交的写入(“脏读”)。

图 7-3说明了原子性的必要性:如果在事务过程中的某个地方发生错误,邮箱和未读计数器的内容可能会变得不同步。在原子事务中,如果计数器更新失败,则事务将中止并回滚插入的电子邮件。

Figure 7-3 illustrates the need for atomicity: if an error occurs somewhere over the course of the transaction, the contents of the mailbox and the unread counter might become out of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted and the inserted email is rolled back.

迪迪亚0703
图 7-3。原子性确保如果发生错误,该事务的任何先前写入都将被撤消,以避免不一致的状态。

多对象事务需要某种方式来确定哪些读取和写入操作属于同一事务。在关系数据库中,这通常是基于客户端到数据库服务器的 TCP 连接来完成的:在任何特定连接上,aBEGIN TRANSACTIONCOMMIT语句之间的所有内容都被视为同一事务的一部分。三、

Multi-object transactions require some way of determining which read and write operations belong to the same transaction. In relational databases, that is typically done based on the client’s TCP connection to the database server: on any particular connection, everything between a BEGIN TRANSACTION and a COMMIT statement is considered to be part of the same transaction.iii

另一方面,许多非关系数据库没有这种将操作分组在一起的方法。即使存在多对象 API(例如,键值存储可能具有在一个操作中更新多个键的多重放置 操作),这并不一定意味着它具有事务语义:该命令可能会成功某些密钥而其他密钥失败,使数据库处于部分更新状态。

On the other hand, many nonrelational databases don’t have such a way of grouping operations together. Even if there is a multi-object API (for example, a key-value store may have a multi-put operation that updates several keys in one operation), that doesn’t necessarily mean it has transaction semantics: the command may succeed for some keys and fail for others, leaving the database in a partially updated state.

单对象写入

Single-object writes

当单个对象发生更改时,原子性和隔离性也适用。例如,假设您正在将 20 KB JSON 文档写入数据库:

Atomicity and isolation also apply when a single object is being changed. For example, imagine you are writing a 20 KB JSON document to a database:

  • 如果在发送第一个 10 KB 后网络连接中断,数据库是否存储该不可解析的 10 KB JSON 片段?

  • If the network connection is interrupted after the first 10 KB have been sent, does the database store that unparseable 10 KB fragment of JSON?

  • 如果当数据库正在覆盖磁盘上的先前值时断电,最终会导致新旧值拼接在一起吗?

  • If the power fails while the database is in the middle of overwriting the previous value on disk, do you end up with the old and new values spliced together?

  • 如果另一个客户端在写入过程中读取该文档,它会看到部分更新的值吗?

  • If another client reads that document while the write is in progress, will it see a partially updated value?

这些问题会非常令人困惑,因此存储引擎几乎普遍致力于在一个节点上的单个对象(例如键值对)级别上提供原子性和隔离性。原子性可以使用日志来实现,以进行崩溃恢复(请参阅“使 B 树可靠”),并且可以使用每个对象上的锁来实现隔离(任一时刻只允许一个线程访问一个对象)。

Those issues would be incredibly confusing, so storage engines almost universally aim to provide atomicity and isolation on the level of a single object (such as a key-value pair) on one node. Atomicity can be implemented using a log for crash recovery (see “Making B-trees reliable”), and isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time).

一些数据库还提供更复杂的原子操作,例如增量操作,这消除了对如图7-1所示的读-修改-写周期的需要。同样流行的是比较和设置操作,只有当其他人没有同时更改该值时才允许写入(请参阅“比较和设置”)。

Some databases also provide more complex atomic operations,iv such as an increment operation, which removes the need for a read-modify-write cycle like that in Figure 7-1. Similarly popular is a compare-and-set operation, which allows a write to happen only if the value has not been concurrently changed by someone else (see “Compare-and-set”).

这些单对象操作非常有用,因为当多个客户端尝试同时写入同一对象时,它们可以防止丢失更新(请参阅“防止丢失更新”)。然而,它们并不是通常意义上的交易。出于 营销目的,比较并设置和其他单对象操作被称为“轻量级事务”甚至“ACID”[20、21、22 ] 术语具有误导性。事务通常被理解为一种将多个对象上的多个操作分组为一个执行单元的机制。

These single-object operations are useful, as they can prevent lost updates when several clients try to write to the same object concurrently (see “Preventing Lost Updates”). However, they are not transactions in the usual sense of the word. Compare-and-set and other single-object operations have been dubbed “lightweight transactions” or even “ACID” for marketing purposes [20, 21, 22], but that terminology is misleading. A transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.

多对象事务的需求

The need for multi-object transactions

许多分布式数据存储已经放弃了多对象事务,因为它们很难跨分区实现,并且在某些需要非常高可用性或性能的场景中可能会造成阻碍。然而,没有什么可以从根本上阻止分布式数据库中的事务,我们将在第 9 章中讨论分布式事务的实现。

Many distributed datastores have abandoned multi-object transactions because they are difficult to implement across partitions, and they can get in the way in some scenarios where very high availability or performance is required. However, there is nothing that fundamentally prevents transactions in a distributed database, and we will discuss implementations of distributed transactions in Chapter 9.

但我们真的需要多对象事务吗?是否可以仅使用键值数据模型和单对象操作来实现任何应用程序?

But do we need multi-object transactions at all? Would it be possible to implement any application with only a key-value data model and single-object operations?

在某些用例中,单对象插入、更新和删除就足够了。然而,在许多其他情况下,需要协调对多个不同对象的写入:

There are some use cases in which single-object inserts, updates, and deletes are sufficient. However, in many other cases writes to several different objects need to be coordinated:

  • 在关系数据模型中,一个表中的行通常具有对另一表中的行的外键引用。(类似地,在类似图的数据模型中,一个顶点具有到其他顶点的边。)多对象事务允许您确保这些引用保持有效:当插入多个相互引用的记录时,外键必须是正确且最新,否则数据将变得毫无意义。

  • In a relational data model, a row in one table often has a foreign key reference to a row in another table. (Similarly, in a graph-like data model, a vertex has edges to other vertices.) Multi-object transactions allow you to ensure that these references remain valid: when inserting several records that refer to one another, the foreign keys have to be correct and up to date, or the data becomes nonsensical.

  • 在文档数据模型中,需要一起更新的字段通常位于同一个文档中,该文档被视为单个对象——更新单个文档时不需要多对象事务。然而,缺乏连接功能的文档数据库也会鼓励非规范化(参见“当今的关系数据库与文档数据库”)。当需要更新非规范化信息时,如图7-2所示,您需要一次性更新多个文档。在这种情况下,事务对于防止非规范化数据不同步非常有用。

  • In a document data model, the fields that need to be updated together are often within the same document, which is treated as a single object—no multi-object transactions are needed when updating a single document. However, document databases lacking join functionality also encourage denormalization (see “Relational Versus Document Databases Today”). When denormalized information needs to be updated, like in the example of Figure 7-2, you need to update several documents in one go. Transactions are very useful in this situation to prevent denormalized data from going out of sync.

  • 在具有二级索引的数据库中(除了纯键值存储之外的几乎所有数据库),每次更改值时也需要更新索引。从事务的角度来看,这些索引是不同的数据库对象:例如,如果没有事务隔离,一条记录可能会出现在一个索引中而不是另一个索引中,因为对第二个索引的更新尚未发生。

  • In databases with secondary indexes (almost everything except pure key-value stores), the indexes also need to be updated every time you change a value. These indexes are different database objects from a transaction point of view: for example, without transaction isolation, it’s possible for a record to appear in one index but not another, because the update to the second index hasn’t happened yet.

此类应用程序仍然可以在没有交易的情况下实现。然而,如果没有原子性,错误处理就会变得更加复杂,并且缺乏隔离性可能会导致并发问题。我们将在“弱隔离级别”中讨论这些内容,并在第 12 章中探索替代方法。

Such applications can still be implemented without transactions. However, error handling becomes much more complicated without atomicity, and the lack of isolation can cause concurrency problems. We will discuss those in “Weak Isolation Levels”, and explore alternative approaches in Chapter 12.

处理错误和中止

Handling errors and aborts

事务的一个关键特性是,如果发生错误,可以中止并安全地重试。ACID 数据库基于这样的理念:如果数据库面临违反原子性、隔离性或持久性保证的危险,它宁愿完全放弃事务,也不愿让它保持半完成状态。

A key feature of a transaction is that it can be aborted and safely retried if an error occurred. ACID databases are based on this philosophy: if the database is in danger of violating its guarantee of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow it to remain half-finished.

但并非所有系统都遵循这一理念。特别是,具有无领导者复制(请参阅“无领导者复制”)的数据存储更多地在“尽力而为”的基础上工作,这可以概括为“数据库将尽其所能,如果遇到错误,它会不会撤消已经完成的操作”——因此应用程序有责任从错误中恢复。

Not all systems follow that philosophy, though. In particular, datastores with leaderless replication (see “Leaderless Replication”) work much more on a “best effort” basis, which could be summarized as “the database will do as much as it can, and if it runs into an error, it won’t undo something it has already done”—so it’s the application’s responsibility to recover from errors.

错误不可避免地会发生,但许多软件开发人员更愿意只考虑快乐的道路,而不是复杂的错误处理。例如,流行的对象关系映射 (ORM) 框架(例如 Rails 的 ActiveRecord 和 Django)不会重试中止的事务 - 该错误通常会导致堆栈中出现异常,因此任何用户输入都会被丢弃,并且用户会收到错误信息。这是一种耻辱,因为中止的全部目的是实现安全重试。

Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling. For example, popular object-relational mapping (ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions—the error usually results in an exception bubbling up the stack, so any user input is thrown away and the user gets an error message. This is a shame, because the whole point of aborts is to enable safe retries.

尽管重试中止的事务是一种简单而有效的错误处理机制,但它并不完美:

Although retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect:

  • 如果事务实际上成功了,但是当服务器尝试向客户端确认成功提交时网络失败(因此客户端认为它失败),那么重试该事务会导致它执行两次 - 除非您有一个额外的应用程序级别重复数据删除机制到位。

  • If the transaction actually succeeded, but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), then retrying the transaction causes it to be performed twice—unless you have an additional application-level deduplication mechanism in place.

  • 如果错误是由于过载造成的,重试事务将使问题变得更糟,而不是更好。为了避免此类反馈循环,您可以限制重试次数,使用指数退避,并以与其他错误不同的方式处理与过载相关的错误(如果可能)。

  • If the error is due to overload, retrying the transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit the number of retries, use exponential backoff, and handle overload-related errors differently from other errors (if possible).

  • 仅在出现暂时性错误(例如由于死锁、隔离违规、临时网络中断和故障转移)后才值得重试;在发生永久性错误(例如,违反约束)后,重试将毫无意义。

  • It is only worth retrying after transient errors (for example due to deadlock, isolation violation, temporary network interruptions, and failover); after a permanent error (e.g., constraint violation) a retry would be pointless.

  • 如果事务在数据库之外也有副作用,那么即使事务中止,这些副作用也可能发生。例如,如果您正在发送电子邮件,您不希望每次重试交易时都再次发送电子邮件。如果您想确保多个不同的系统一起提交或中止,两阶段提交可以提供帮助(我们将在“原子提交和两阶段提交(2PC)”中讨论这一点)。

  • If the transaction also has side effects outside of the database, those side effects may happen even if the transaction is aborted. For example, if you’re sending an email, you wouldn’t want to send the email again every time you retry the transaction. If you want to make sure that several different systems either commit or abort together, two-phase commit can help (we will discuss this in “Atomic Commit and Two-Phase Commit (2PC)”).

  • 如果客户端进程在重试时失败,则它尝试写入数据库的所有数据都会丢失。

  • If the client process fails while retrying, any data it was trying to write to the database is lost.

弱隔离级别

Weak Isolation Levels

如果两个事务不接触相同的数据,则它们可以安全地并行运行,因为两者都不依赖于另一个。仅当一个事务读取另一事务同时修改的数据,或者两个事务尝试同时修改相同的数据时,并发问题(竞争条件)才会出现。

If two transactions don’t touch the same data, they can safely be run in parallel, because neither depends on the other. Concurrency issues (race conditions) only come into play when one transaction reads data that is concurrently modified by another transaction, or when two transactions try to simultaneously modify the same data.

并发错误很难通过测试发现,因为只有在时机不走运时才会触发此类错误。此类计时问题可能很少发生,并且通常很难重现。并发性也很难推理,特别是在大型应用程序中,您不一定知道哪些其他代码正在访问数据库。如果一次只有一个用户,应用程序开发就已经足够困难了;拥有许多并发用户会使事情变得更加困难,因为任何数据都可能随时发生意外变化。

Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to reproduce. Concurrency is also very difficult to reason about, especially in a large application where you don’t necessarily know which other pieces of code are accessing the database. Application development is difficult enough if you just have one user at a time; having many concurrent users makes it much harder still, because any piece of data could unexpectedly change at any time.

因此,数据库长期以来一直试图通过提供事务隔离来向应用程序开发人员隐藏并发问题。从理论上讲,隔离应该让您假装没有发生并发,从而使您的生活更轻松:可序列化隔离意味着数据库保证事务具有与串行运行相同的效果(即一次一个,没有任何并发​​)。

For that reason, databases have long tried to hide concurrency issues from application developers by providing transaction isolation. In theory, isolation should make your life easier by letting you pretend that no concurrency is happening: serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially (i.e., one at a time, without any concurrency).

不幸的是,在实践中,隔离并不是那么简单。可序列化隔离会带来性能成本,许多数据库不想付出这个代价[ 8 ]。因此,系统通常使用较弱的隔离级别,以防止某些并发问题,但不是全部。这些隔离级别更难理解,并且可能会导致微妙的错误,但它们仍然在实践中使用[ 23 ]。

In practice, isolation is unfortunately not that simple. Serializable isolation has a performance cost, and many databases don’t want to pay that price [8]. It’s therefore common for systems to use weaker levels of isolation, which protect against some concurrency issues, but not all. Those levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are nevertheless used in practice [23].

由弱事务隔离引起的并发错误不仅仅是一个理论上的问题。它们造成了大量资金损失 [ 24 , 25 ],导致财务审计师进行调查 [ 26 ],并导致客户数据被损坏 [ 27 ]。对于此类问题的揭露,一个流行的评论是“如果要处理财务数据,请使用 ACID 数据库!”——但这没有抓住要点。即使许多流行的关系数据库系统(通常被认为是“ACID”)也使用弱隔离,因此它们不一定能防止这些错误的发生。

Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have caused substantial loss of money [24, 25], led to investigation by financial auditors [26], and caused customer data to be corrupted [27]. A popular comment on revelations of such problems is “Use an ACID database if you’re handling financial data!”—but that misses the point. Even many popular relational database systems (which are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these bugs from occurring.

我们需要充分了解存在的并发问题以及如何预防这些问题,而不是盲目依赖工具。然后,我们可以使用我们可以使用的工具构建可靠且正确的应用程序。

Rather than blindly relying on tools, we need to develop a good understanding of the kinds of concurrency problems that exist, and how to prevent them. Then we can build applications that are reliable and correct, using the tools at our disposal.

在本节中,我们将研究实践中使用的几种弱(不可序列化)隔离级别,并详细讨论哪些类型的竞争条件可以发生和不能发生,以便您可以决定什么级别适合您的应用程序。完成此操作后,我们将详细讨论可序列化性(请参阅“可序列化性”)。我们对隔离级别的讨论将是非正式的,并使用示例。如果您想要对其属性进行严格的定义和分析,您可以在学术文献中找到它们 [ 28,29,30 ]

In this section we will look at several weak (nonserializable) isolation levels that are used in practice, and discuss in detail what kinds of race conditions can and cannot occur, so that you can decide what level is appropriate to your application. Once we’ve done that, we will discuss serializability in detail (see “Serializability”). Our discussion of isolation levels will be informal, using examples. If you want rigorous definitions and analyses of their properties, you can find them in the academic literature [28, 29, 30].

读已提交

Read Committed

最基本的事务隔离级别是 已提交读v它做出两个保证:

The most basic level of transaction isolation is read committed.v It makes two guarantees:

  1. 从数据库读取时,您只会看到已提交的数据(没有脏读)。

  2. When reading from the database, you will only see data that has been committed (no dirty reads).

  3. 写入数据库时​​,只会覆盖已提交的数据(不会出现脏写)。

  4. When writing to the database, you will only overwrite data that has been committed (no dirty writes).

让我们更详细地讨论这两个保证。

Let’s discuss these two guarantees in more detail.

无脏读

No dirty reads

想象一个事务已将一些数据写入数据库,但该事务尚未提交或中止。另一个事务可以看到未提交的数据吗?如果是,则称为 脏读[ 2 ]。

Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted. Can another transaction see that uncommitted data? If yes, that is called a dirty read [2].

在已提交读隔离级别运行的事务必须防止脏读。这意味着事务的任何写入仅在该事务提交时才对其他人可见(然后其所有写入立即变得可见)。如图 7-4所示 ,其中用户 1 设置了x = 3,但用户 2 的get x仍返回旧值 2,而用户 1 尚未提交。

Transactions running at the read committed isolation level must prevent dirty reads. This means that any writes by a transaction only become visible to others when that transaction commits (and then all of its writes become visible at once). This is illustrated in Figure 7-4, where user 1 has set x = 3, but user 2’s get x still returns the old value, 2, while user 1 has not yet committed.

迪迪亚0704
图 7-4。无脏读:只有在用户 1 的事务提交后,用户 2 才能看到x的新值。

防止脏读很有用的原因如下:

There are a few reasons why it’s useful to prevent dirty reads:

  • 如果一个事务需要更新多个对象,则脏读意味着另一个事务可能会看到某些更新,但看不到其他更新。例如,在图 7-2中,用户看到新的未读电子邮件,但看不到更新的计数器。这是对电子邮件的脏读。看到数据库处于部分更新状态会让用户感到困惑,并可能导致其他事务做出错误的决定。

  • If a transaction needs to update several objects, a dirty read means that another transaction may see some of the updates but not others. For example, in Figure 7-2, the user sees the new unread email but not the updated counter. This is a dirty read of the email. Seeing the database in a partially updated state is confusing to users and may cause other transactions to take incorrect decisions.

  • 如果事务中止,则需要回滚它所做的任何写入(如图 7-3 所示)。如果数据库允许脏读,则意味着事务可能会看到稍后回滚的数据,即从未实际提交到数据库的数据。对后果的推理很快就会变得令人费解。

  • If a transaction aborts, any writes it has made need to be rolled back (like in Figure 7-3). If the database allows dirty reads, that means a transaction may see data that is later rolled back—i.e., which is never actually committed to the database. Reasoning about the consequences quickly becomes mind-bending.

无脏写

No dirty writes

如果两个事务同时尝试更新数据库中的同一对象,会发生什么情况?我们不知道写入会按什么顺序发生,但我们通常假设后面的写入会覆盖前面的写入。

What happens if two transactions concurrently try to update the same object in a database? We don’t know in which order the writes will happen, but we normally assume that the later write overwrites the earlier write.

但是,如果较早的写入是尚未提交的事务的一部分,因此较晚的写入会覆盖未提交的值,会发生什么情况?这称为脏写 [ 28 ]。在读已提交隔离级别运行的事务必须防止脏写,通常是通过延迟第二次写操作直到第一个写操作的事务提交或中止。

However, what happens if the earlier write is part of a transaction that has not yet committed, so the later write overwrites an uncommitted value? This is called a dirty write [28]. Transactions running at the read committed isolation level must prevent dirty writes, usually by delaying the second write until the first write’s transaction has committed or aborted.

通过防止脏写,此隔离级别避免了某些并发问题:

By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:

  • 如果事务更新多个对象,脏写可能会导致不良结果。例如,请考虑图 7-5,它说明了一个二手车销售网站,其中两个人(Alice 和 Bob)同时尝试购买同一辆车。购买汽车需要两次数据库写入:需要更新网站上的列表以反映买家,并且需要将销售发票发送给买家。在图 7-5的情况下,销售被授予给 Bob(因为他对表执行了获胜的更新listings),但发票被发送给了 Alice(因为她对表执行了获胜的更新invoices)。读已提交可以防止此类事故的发生。

  • If transactions update multiple objects, dirty writes can lead to a bad outcome. For example, consider Figure 7-5, which illustrates a used car sales website on which two people, Alice and Bob, are simultaneously trying to buy the same car. Buying a car requires two database writes: the listing on the website needs to be updated to reflect the buyer, and the sales invoice needs to be sent to the buyer. In the case of Figure 7-5, the sale is awarded to Bob (because he performs the winning update to the listings table), but the invoice is sent to Alice (because she performs the winning update to the invoices table). Read committed prevents such mishaps.

  • 然而,读已提交并不能阻止图 7-1中两个计数器增量之间的竞争条件 。在这种情况下,第二次写入发生在第一个事务提交之后,因此它不是脏写入。它仍然是不正确的,但出于不同的原因——在 “防止丢失更新”中,我们将讨论如何使此类计数器增量安全。

  • However, read committed does not prevent the race condition between two counter increments in Figure 7-1. In this case, the second write happens after the first transaction has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in “Preventing Lost Updates” we will discuss how to make such counter increments safe.

迪迪亚0705
图 7-5。对于脏写入,来自不同事务的冲突写入可能会混合在一起。

实施已提交读

Implementing read committed

已提交读是一种非常流行的隔离级别。它是 Oracle 11g、PostgreSQL、SQL Server 2012、MemSQL 和许多其他数据库中的默认设置 [ 8 ]。

Read committed is a very popular isolation level. It is the default setting in Oracle 11g, PostgreSQL, SQL Server 2012, MemSQL, and many other databases [8].

最常见的是,数据库通过使用行级锁来防止脏写:当事务想要修改特定对象(行或文档)时,它必须首先获取该对象的锁。然后它必须持有该锁,直到事务被提交或中止。对于任何给定的对象,只有一个事务可以持有锁;如果另一个事务想要写入同一个对象,它必须等到第一个事务提交或中止后才能获取锁并继续。这种锁定是由数据库在已提交读模式(或更强大的隔离级别)下自动完成的。

Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular object (row or document), it must first acquire a lock on that object. It must then hold that lock until the transaction is committed or aborted. Only one transaction can hold the lock for any given object; if another transaction wants to write to the same object, it must wait until the first transaction is committed or aborted before it can acquire the lock and continue. This locking is done automatically by databases in read committed mode (or stronger isolation levels).

我们如何防止脏读呢?一种选择是使用相同的锁,并要求任何想要读取对象的事务短暂获取该锁,然后在读取后立即再次释放它。这将确保当对象具有脏的、未提交的值时不会发生读取(因为在此期间锁将由进行写入的事务持有)。

How do we prevent dirty reads? One option would be to use the same lock, and to require any transaction that wants to read an object to briefly acquire the lock and then release it again immediately after reading. This would ensure that a read couldn’t happen while an object has a dirty, uncommitted value (because during that time the lock would be held by the transaction that has made the write).

然而,要求读锁的方法在实践中效果不佳,因为一个长时间运行的写事务可能会迫使许多只读事务等待,直到长时间运行的事务完成。这会损害只读事务的响应时间,并且不利于可操作性:由于等待锁,应用程序某一部分的速度减慢可能会对应用程序的完全不同部分产生连锁反应。

However, the approach of requiring read locks does not work well in practice, because one long-running write transaction can force many read-only transactions to wait until the long-running transaction has completed. This harms the response time of read-only transactions and is bad for operability: a slowdown in one part of an application can have a knock-on effect in a completely different part of the application, due to waiting for locks.

因此,大多数数据库使用图 7-4 所示的方法来防止脏读对于每个写入的对象,数据库会记住旧的提交值和当前持有写锁的事务设置的新值。当事务正在进行时,读取该对象的任何其他事务都会被简单地赋予旧值。仅当提交新值时,事务才会切换到读取新值。

For that reason, most databasesvi prevent dirty reads using the approach illustrated in Figure 7-4: for every object that is written, the database remembers both the old committed value and the new value set by the transaction that currently holds the write lock. While the transaction is ongoing, any other transactions that read the object are simply given the old value. Only when the new value is committed do transactions switch over to reading the new value.

快照隔离和可重复读取

Snapshot Isolation and Repeatable Read

如果您从表面上看读提交隔离,您可能会认为它做了事务需要做的所有事情:它允许中止(原子性所需),它防止读取事务的不完整结果,并且它防止并发写入变得混合起来。事实上,这些都是有用的功能,并且比没有交易的系统提供的保证要强大得多。

If you look superficially at read committed isolation, you could be forgiven for thinking that it does everything that a transaction needs to do: it allows aborts (required for atomicity), it prevents reading the incomplete results of transactions, and it prevents concurrent writes from getting intermingled. Indeed, those are useful features, and much stronger guarantees than you can get from a system that has no transactions.

但是,在使用此隔离级别时,仍然有很多方式可能会出现并发错误。例如,图 7-6说明了已提交读时可能出现的问题。

However, there are still plenty of ways in which you can have concurrency bugs when using this isolation level. For example, Figure 7-6 illustrates a problem that can occur with read committed.

迪迪亚0706
图 7-6。读倾斜:Alice 观察到数据库处于不一致状态。

假设爱丽丝在银行有 1,000 美元的存款,分为两个账户,每个账户 500 美元。现在,一笔交易将 100 美元从她的一个账户转移到另一个账户。如果她不幸在处理交易的同时查看账户余额列表,则在收到付款到达之前,她可能会一次看到一个账户余额(余额为 500 美元),而另一个账户余额则可能会在付款到达之前一次看到一个账户余额(余额为 500 美元)。转账完成后的账户(新余额为 400 美元)。对于 Alice 来说,现在她的账户里似乎只有 900 美元——100 美元似乎已经化为乌有。

Say Alice has $1,000 of savings at a bank, split across two accounts with $500 each. Now a transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her list of account balances in the same moment as that transaction is being processed, she may see one account balance at a time before the incoming payment has arrived (with a balance of $500), and the other account after the outgoing transfer has been made (the new balance being $400). To Alice it now appears as though she only has a total of $900 in her accounts—it seems that $100 has vanished into thin air.

这种异常称为不可重复读取读取偏差:如果 Alice 在交易结束时再次读取帐户 1 的余额,她将看到与之前查询中看到的不同的值(600 美元)。在读提交隔离下,读偏差被认为是可以接受的:Alice 看到的账户余额在她读取它们时确实已提交。

This anomaly is called a nonrepeatable read or read skew: if Alice were to read the balance of account 1 again at the end of the transaction, she would see a different value ($600) than she saw in her previous query. Read skew is considered acceptable under read committed isolation: the account balances that Alice saw were indeed committed at the time when she read them.

笔记

不幸的是, “倾斜” 这个词被过度使用了:我们之前用它来表示具有热点的不平衡工作负载(请参阅“倾斜的工作负载和缓解热点”),而这里它的意思是时序异常

The term skew is unfortunately overloaded: we previously used it in the sense of an unbalanced workload with hot spots (see “Skewed Workloads and Relieving Hot Spots”), whereas here it means timing anomaly.

对于爱丽丝来说,这不是一个持久的问题,因为如果她几秒钟后重新加载网上银行网站,她很可能会看到一致的账户余额。然而,有些情况不能容忍这种暂时的不一致:

In Alice’s case, this is not a lasting problem, because she will most likely see consistent account balances if she reloads the online banking website a few seconds later. However, some situations cannot tolerate such temporary inconsistency:

备份
Backups

进行备份需要制作整个数据库的副本,这对于大型数据库可能需要数小时。在备份过程运行期间,将继续对数据库进行写入。因此,您最终可能会得到备份的某些部分包含较旧版本的数据,而其他部分则包含较新版本的数据。如果您需要从这样的备份进行恢复,不一致的情况(例如消失的钱)将成为永久性的。

Taking a backup requires making a copy of the entire database, which may take hours on a large database. During the time that the backup process is running, writes will continue to be made to the database. Thus, you could end up with some parts of the backup containing an older version of the data, and other parts containing a newer version. If you need to restore from such a backup, the inconsistencies (such as disappearing money) become permanent.

分析查询和完整性检查
Analytic queries and integrity checks

有时,您可能想要运行一个扫描数据库大部分内容的查询。此类查询在分析中很常见(请参阅“事务处理还是分析?”),或者可能是定期完整性检查的一部分,以确保一切正常(监控数据损坏)。如果这些查询在不同时间点观察数据库的某些部分,则可能会返回无意义的结果。

Sometimes, you may want to run a query that scans over large parts of the database. Such queries are common in analytics (see “Transaction Processing or Analytics?”), or may be part of a periodic integrity check that everything is in order (monitoring for data corruption). These queries are likely to return nonsensical results if they observe parts of the database at different points in time.

快照隔离[ 28 ]是这个问题最常见的解决方案。这个想法是,每个事务都从数据库的一致快照中读取,也就是说,事务会看到在事务开始时数据库中提交的所有数据。即使数据随后被另一个事务更改,每个事务也只能看到该特定时间点的旧数据。

Snapshot isolation [28] is the most common solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database—that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each transaction sees only the old data from that particular point in time.

快照隔离对于长时间运行的只读查询(例如备份和分析)来说是一个福音。如果查询所操作的数据在查询执行的同时发生变化,则很难推断查询的含义。当事务可以看到数据库的一致快照(冻结在特定时间点)时,就更容易理解。

Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It is very hard to reason about the meaning of a query if the data on which it operates is changing at the same time as the query is executing. When a transaction can see a consistent snapshot of the database, frozen at a particular point in time, it is much easier to understand.

快照隔离是一个流行的 功能:PostgreSQL、带有 InnoDB存储 引擎的 MySQL、Oracle、SQL Server 等都支持它[ 23、31、32 ]。

Snapshot isolation is a popular feature: it is supported by PostgreSQL, MySQL with the InnoDB storage engine, Oracle, SQL Server, and others [23, 31, 32].

实施快照隔离

Implementing snapshot isolation

与读提交隔离一样,快照隔离的实现通常使用写锁来防止脏写(请参阅“实现读提交”),这意味着进行写操作的事务可以阻止写入同一对象的另一个事务的进度。但是,读取不需要任何锁。从性能角度来看,快照隔离的一个关键原则是读取器永远不会阻塞写入器,写入器也永远不会阻塞读取器。这允许数据库在正常处理写入的同时处理一致快照上的长时间运行的读取查询,而两者之间不会出现任何锁争用。

Like read committed isolation, implementations of snapshot isolation typically use write locks to prevent dirty writes (see “Implementing read committed”), which means that a transaction that makes a write can block the progress of another transaction that writes to the same object. However, reads do not require any locks. From a performance point of view, a key principle of snapshot isolation is readers never block writers, and writers never block readers. This allows a database to handle long-running read queries on a consistent snapshot at the same time as processing writes normally, without any lock contention between the two.

为了实现快照隔离,数据库使用了我们在图 7-4 中看到的防止脏读的机制的概括。数据库必须潜在地保留对象的多个不同提交版本,因为各种正在进行的事务可能需要查看数据库在不同时间点的状态。由于它并排维护对象的多个版本,因此该技术称为多版本并发控制(MVCC)。

To implement snapshot isolation, databases use a generalization of the mechanism we saw for preventing dirty reads in Figure 7-4. The database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time. Because it maintains several versions of an object side by side, this technique is known as multi-version concurrency control (MVCC).

如果数据库只需要提供读提交隔离,而不需要快照隔离,那么保留对​​象的两个版本就足够了:已提交版本和覆盖但尚未提交版本。然而,支持快照隔离的存储引擎通常也使用 MVCC 作为其读提交隔离级别。典型的方法是读已提交对每个查询使用单独的快照,而快照隔离对整个事务使用相同的快照。

If a database only needed to provide read committed isolation, but not snapshot isolation, it would be sufficient to keep two versions of an object: the committed version and the overwritten-but-not-yet-committed version. However, storage engines that support snapshot isolation typically use MVCC for their read committed isolation level as well. A typical approach is that read committed uses a separate snapshot for each query, while snapshot isolation uses the same snapshot for an entire transaction.

图7-7展示了PostgreSQL中基于MVCC的快照隔离是如何实现的[ 31 ](其他实现类似)。当一个事务开始时,它会被赋予一个唯一的、不断增加的vii 事务 ID ( txid)。每当事务向数据库写入任何内容时,它写入的数据都会用写入者的事务 ID 进行标记。

Figure 7-7 illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL [31] (other implementations are similar). When a transaction is started, it is given a unique, always-increasingvii transaction ID (txid). Whenever a transaction writes anything to the database, the data it writes is tagged with the transaction ID of the writer.

直达0707
图 7-7。使用多版本对象实现快照隔离。

表中的每一行都有一个created_by字段,包含将该行插入表中的事务的 ID。此外,每一行都有一个deleted_by字段,该字段最初为空。deleted_by如果事务删除一行,该行实际上并未从数据库中删除,而是通过将该字段设置为请求删除的事务的 ID将其标记为删除。稍后,当确定没有事务可以再访问已删除的数据时,数据库中的垃圾收集过程将删除所有标记为删除的行并释放其空间。

Each row in a table has a created_by field, containing the ID of the transaction that inserted this row into the table. Moreover, each row has a deleted_by field, which is initially empty. If a transaction deletes a row, the row isn’t actually deleted from the database, but it is marked for deletion by setting the deleted_by field to the ID of the transaction that requested the deletion. At some later time, when it is certain that no transaction can any longer access the deleted data, a garbage collection process in the database removes any rows marked for deletion and frees their space.

更新在内部转换为删除和创建。例如,在 图 7-7中,交易 13 从账户 2 中扣除 100 美元,将余额从 500 美元更改为 400 美元。现在,该accounts表实际上包含帐户 2 的两行:余额为 500 美元的行,被事务 13 标记为已删除;余额为 400 美元的行,由事务 13 创建。

An update is internally translated into a delete and a create. For example, in Figure 7-7, transaction 13 deducts $100 from account 2, changing the balance from $500 to $400. The accounts table now actually contains two rows for account 2: a row with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of $400 which was created by transaction 13.

用于观察一致快照的可见性规则

Visibility rules for observing a consistent snapshot

当事务从数据库读取时,事务 ID 用于决定哪些对象可以看到,哪些对象不可见。通过仔细定义可见性规则,数据库可以向应用程序呈现数据库的一致快照。其工作原理如下:

When a transaction reads from the database, transaction IDs are used to decide which objects it can see and which are invisible. By carefully defining visibility rules, the database can present a consistent snapshot of the database to the application. This works as follows:

  1. 在每个事务开始时,数据库都会列出当时正在进行的(尚未提交或中止)的所有其他事务。即使这些事务随后提交,这些事务所做的任何写入都会被忽略。

  2. At the start of each transaction, the database makes a list of all the other transactions that are in progress (not yet committed or aborted) at that time. Any writes that those transactions have made are ignored, even if the transactions subsequently commit.

  3. 中止事务所做的任何写入都将被忽略。

  4. Any writes made by aborted transactions are ignored.

  5. 具有较晚事务 ID 的事务(即,在当前事务启动之后开始的事务)进行的任何写入都将被忽略,无论这些事务是否已提交。

  6. Any writes made by transactions with a later transaction ID (i.e., which started after the current transaction started) are ignored, regardless of whether those transactions have committed.

  7. 所有其他写入对于应用程序的查询都是可见的。

  8. All other writes are visible to the application’s queries.

这些规则适用于对象的创建和删除。在图 7-7中,当交易 12 从账户 2 读取数据时,它会看到余额为 500 美元,因为交易 13 删除了 500 美元的余额(根据规则 3,交易 12 无法看到交易 13 进行的删除),并且 400 美元余额的创建尚不可见(按照相同的规则)。

These rules apply to both creation and deletion of objects. In Figure 7-7, when transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500 balance was made by transaction 13 (according to rule 3, transaction 12 cannot see a deletion made by transaction 13), and the creation of the $400 balance is not yet visible (by the same rule).

换句话说,如果满足以下两个条件,则对象可见:

Put another way, an object is visible if both of the following conditions are true:

  • 当读者的事务开始时,创建该对象的事务已经提交。

  • At the time when the reader’s transaction started, the transaction that created the object had already committed.

  • 该对象未标记为删除,或者如果是,则请求删除的事务在读取器的事务开始时尚未提交。

  • The object is not marked for deletion, or if it is, the transaction that requested deletion had not yet committed at the time when the reader’s transaction started.

长时间运行的事务可能会继续长时间使用快照,继续读取(从其他事务的角度来看)早已被覆盖或删除的值。通过从不更新值,而是在每次更改值时创建一个新版本,数据库可以提供一致的快照,同时只产生很小的开销。

A long-running transaction may continue using a snapshot for a long time, continuing to read values that (from other transactions’ point of view) have long been overwritten or deleted. By never updating values in place but instead creating a new version every time a value is changed, the database can provide a consistent snapshot while incurring only a small overhead.

索引和快照隔离

Indexes and snapshot isolation

索引如何在多版本数据库中工作?一种选择是让索引简单地指向对象的所有版本,并需要索引查询来过滤掉当前事务不可见的任何对象版本。当垃圾收集删除对任何事务不再可见的旧对象版本时,相应的索引条目也可以被删除。

How do indexes work in a multi-version database? One option is to have the index simply point to all versions of an object and require an index query to filter out any object versions that are not visible to the current transaction. When garbage collection removes old object versions that are no longer visible to any transaction, the corresponding index entries can also be removed.

在实际应用中,很多实现细节决定了多版本并发控制的性能。例如,如果同一对象的不同版本可以容纳在同一页面上,则 PostgreSQL 会进行优化以避免索引更新 [ 31 ]。

In practice, many implementation details determine the performance of multi-version concurrency control. For example, PostgreSQL has optimizations for avoiding index updates if different versions of the same object can fit on the same page [31].

另一种方法用于 CouchDB、Datomic 和 LMDB。尽管它们也使用 B 树(请参阅 “B 树”),但它们使用仅追加/写时复制变体,该变体在更新时不会覆盖树的页面,而是为每个树创建一个新副本修改后的页面。父页面,直到树的根部,都会被复制和更新以指向其子页面的新版本。任何不受写入影响的页面 都不需要复制,并且保持不变[ 33,34,35 ]

Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see “B-Trees”), they use an append-only/copy-on-write variant that does not overwrite pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages. Any pages that are not affected by a write do not need to be copied, and remain immutable [33, 34, 35].

对于仅追加 B 树,每个写入事务(或批量事务)都会创建一个新的 B 树根,并且特定根是数据库在创建时的一致快照。不需要根据事务ID过滤掉对象,因为后续写入无法修改现有的B树;他们只能创造新的树根。然而,这种方法还需要一个用于压缩和垃圾收集的后台进程。

With append-only B-trees, every write transaction (or batch of transactions) creates a new B-tree root, and a particular root is a consistent snapshot of the database at the point in time when it was created. There is no need to filter out objects based on transaction IDs because subsequent writes cannot modify an existing B-tree; they can only create new tree roots. However, this approach also requires a background process for compaction and garbage collection.

可重复读取和命名混乱

Repeatable read and naming confusion

快照隔离是一种有用的隔离级别,特别是对于只读事务。然而,许多实现它的数据库用不同的名称来称呼它。在 Oracle 中,它被称为可序列化,在 PostgreSQL 和 MySQL 中,它被称为可重复读取 [ 23 ]。

Snapshot isolation is a useful isolation level, especially for read-only transactions. However, many databases that implement it call it by different names. In Oracle it is called serializable, and in PostgreSQL and MySQL it is called repeatable read [23].

这种命名混乱的原因是 SQL 标准没有快照隔离的概念,因为该标准基于 System R 1975 年的隔离级别定义 [2],当时快照隔离尚未发明。相反,它定义了可重复读取,表面上看起来类似于快照隔离。PostgreSQL和MySQL将它们的快照隔离级别称为可重复读,因为它满足标准的要求,因此它们可以声称符合标准。

The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot isolation, because the standard is based on System R’s 1975 definition of isolation levels [2] and snapshot isolation hadn’t yet been invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot isolation. PostgreSQL and MySQL call their snapshot isolation level repeatable read because it meets the requirements of the standard, and so they can claim standards compliance.

不幸的是,SQL 标准对隔离级别的定义是有缺陷的——它含糊不清、不精确,并且不像标准应有的那样独立于实现[ 28 ]。尽管一些数据库实现了可重复读取,但它们实际上提供的保证存在很大差异,尽管表面上是标准化的[ 23 ]。研究文献 [ 29 , 30 ]中对可重复读取有一个正式的定义,但大多数实现并不满足该正式定义。最重要的是,IBM DB2 使用“可重复读取”来指代可串行性 [ 8 ]。

Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous, imprecise, and not as implementation-independent as a standard should be [28]. Even though several databases implement repeatable read, there are big differences in the guarantees they actually provide, despite being ostensibly standardized [23]. There has been a formal definition of repeatable read in the research literature [29, 30], but most implementations don’t satisfy that formal definition. And to top it off, IBM DB2 uses “repeatable read” to refer to serializability [8].

结果,没有人真正知道可重复读取意味着什么。

As a result, nobody really knows what repeatable read means.

防止丢失更新

Preventing Lost Updates

到目前为止,我们讨论的读已提交和快照隔离级别主要是关于保证只读事务在并发写入的情况下可以看到的内容。我们大多忽略了两个事务同时写入的问题——我们只讨论了脏写(参见 “无脏写”),这是一种可能发生的特殊类型的写-写冲突。

The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored the issue of two transactions writing concurrently—we have only discussed dirty writes (see “No dirty writes”), one particular type of write-write conflict that can occur.

并发写入事务之间还可能发生其他几种有趣的冲突。其中最著名的是丢失更新问题, 如图 7-1中两个并发计数器增量的示例所示。

There are several other interesting kinds of conflicts that can occur between concurrently writing transactions. The best known of these is the lost update problem, illustrated in Figure 7-1 with the example of two concurrent counter increments.

如果应用程序从数据库读取某些值,修改它,然后写回修改后的值(读-修改-写循环),则可能会发生丢失更新问题。如果两个事务同时执行此操作,则其中一个修改可能会丢失,因为第二次写入不包括第一次修改。(我们有时会说后面的写入会破坏前面的写入。)这种模式出现在各种不同的场景中:

The lost update problem can occur if an application reads some value from the database, modifies it, and writes back the modified value (a read-modify-write cycle). If two transactions do this concurrently, one of the modifications can be lost, because the second write does not include the first modification. (We sometimes say that the later write clobbers the earlier write.) This pattern occurs in various different scenarios:

  • 增加计数器或更新帐户余额(需要读取当前值、计算新值并写回更新后的值)

  • Incrementing a counter or updating an account balance (requires reading the current value, calculating the new value, and writing back the updated value)

  • 对复杂值进行本地更改,例如,将元素添加到 JSON 文档内的列表(需要解析文档、进行更改并写回修改后的文档)

  • Making a local change to a complex value, e.g., adding an element to a list within a JSON document (requires parsing the document, making the change, and writing back the modified document)

  • 两个用户同时编辑 wiki 页面,每个用户通过将整个页面内容发送到服务器来保存更改,覆盖数据库中当前的所有内容

  • Two users editing a wiki page at the same time, where each user saves their changes by sending the entire page contents to the server, overwriting whatever is currently in the database

由于这是一个常见问题,因此开发了多种解决方案。

Because this is such a common problem, a variety of solutions have been developed.

原子写操作

Atomic write operations

许多数据库提供原子更新操作,从而无需在应用程序代码中实现读取-修改-写入循环。如果您的代码可以用这些操作来表达,那么它们通常是最好的解决方案。例如,以下指令在大多数关系数据库中是并发安全的:

Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. They are usually the best solution if your code can be expressed in terms of those operations. For example, the following instruction is concurrency-safe in most relational databases:

UPDATE counters SET value = value + 1 WHERE key = 'foo';
UPDATE counters SET value = value + 1 WHERE key = 'foo';

类似地,MongoDB 等文档数据库提供了对 JSON 文档的一部分进行本地修改的原子操作,Redis 提供了修改优先级队列等数据结构的原子操作。并非所有写入都可以轻松地用原子操作来表达 - 例如,对 wiki 页面的更新涉及任意文本编辑viii - 但在可以使用原子操作的情况下,它们通常是最佳选择。

Similarly, document databases such as MongoDB provide atomic operations for making local modifications to a part of a JSON document, and Redis provides atomic operations for modifying data structures such as priority queues. Not all writes can easily be expressed in terms of atomic operations—for example, updates to a wiki page involve arbitrary text editingviii—but in situations where atomic operations can be used, they are usually the best choice.

原子操作通常是通过在读取对象时对其进行独占锁定来实现的,以便在应用更新之前没有其他事务可以读取它。这种技术有时被称为光标稳定性[ 36 , 37 ]。另一种选择是简单地强制所有原子操作在单个线程上执行。

Atomic operations are usually implemented by taking an exclusive lock on the object when it is read so that no other transaction can read it until the update has been applied. This technique is sometimes known as cursor stability [36, 37]. Another option is to simply force all atomic operations to be executed on a single thread.

不幸的是,对象关系映射框架很容易意外地编写执行不安全的读取-修改-写入循环的代码,而不是使用数据库提供的原子操作[38 ]。如果您知道自己在做什么,那么这不是问题,但它可能是难以通过测试发现的微妙错误的来源。

Unfortunately, object-relational mapping frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database [38]. That’s not a problem if you know what you are doing, but it is potentially a source of subtle bugs that are difficult to find by testing.

显式锁定

Explicit locking

如果数据库的内置原子操作不提供必要的功能,则防止丢失更新的另一个选项是应用程序显式锁定要更新的对象。然后,应用程序可以执行读取-修改-写入周期,并且如果任何其他事务尝试同时读取同一对象,则它被迫等待,直到第一个读取-修改-写入周期完成。

Another option for preventing lost updates, if the database’s built-in atomic operations don’t provide the necessary functionality, is for the application to explicitly lock objects that are going to be updated. Then the application can perform a read-modify-write cycle, and if any other transaction tries to concurrently read the same object, it is forced to wait until the first read-modify-write cycle has completed.

例如,考虑一个多人游戏,其中多个玩家可以同时移动同一个人物。在这种情况下,原子操作可能还不够,因为应用程序还需要确保玩家的移动遵守游戏规则,这涉及到一些您无法明智地实现为数据库查询的逻辑。相反,您可以使用锁来防止两个玩家同时移动同一个棋子,如示例 7-1所示。

For example, consider a multiplayer game in which several players can move the same figure concurrently. In this case, an atomic operation may not be sufficient, because the application also needs to ensure that a player’s move abides by the rules of the game, which involves some logic that you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two players from concurrently moving the same piece, as illustrated in Example 7-1.

例7-1。显式锁定行以防止丢失更新
BEGIN TRANSACTION;

SELECT * FROM figures
  WHERE name = 'robot' AND game_id = 222
  FOR UPDATE; 1

-- Check whether move is valid, then update the position
-- of the piece that was returned by the previous SELECT.
UPDATE figures SET position = 'c4' WHERE id = 1234;

COMMIT;
BEGIN TRANSACTION;

SELECT * FROM figures
  WHERE name = 'robot' AND game_id = 222
  FOR UPDATE; 

-- Check whether move is valid, then update the position
-- of the piece that was returned by the previous SELECT.
UPDATE figures SET position = 'c4' WHERE id = 1234;

COMMIT;
1

FOR UPDATE子句指示数据库应该锁定该查询返回的所有行。

The FOR UPDATE clause indicates that the database should take a lock on all rows returned by this query.

这是可行的,但为了做到正确,您需要仔细考虑您的应用程序逻辑。人们很容易忘记在代码中的某处添加必要的锁,从而引入竞争条件。

This works, but to get it right, you need to carefully think about your application logic. It’s easy to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.

自动检测丢失的更新

Automatically detecting lost updates

原子操作和锁是通过强制读取-修改-写入周期按顺序发生来防止丢失更新的方法。另一种方法是允许它们并行执行,如果事务管理器检测到丢失的更新,则中止事务并强制其重试读取-修改-写入周期。

Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle.

这种方法的优点是数据库可以结合快照隔离有效地执行此检查。事实上,PostgreSQL 的可重复读取、Oracle 的可序列化和 SQL Server 的快照隔离级别会自动检测何时发生丢失更新并中止有问题的事务。然而,MySQL/InnoDB 的可重复读取无法检测丢失的更新[ 23 ]。一些作者 [ 28 , 30 ] 认为数据库必须防止丢失更新才能有资格提供快照隔离,因此 MySQL 在此定义下不提供快照隔离。

An advantage of this approach is that databases can perform this check efficiently in conjunction with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates [23]. Some authors [28, 30] argue that a database must prevent lost updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot isolation under this definition.

丢失更新检测是一个很棒的功能,因为它不需要应用程序代码来使用任何特殊的数据库功能 - 您可能会忘记使用锁或原子操作,从而引入错误,但丢失更新检测会自动发生,因此更少容易出错。

Lost update detection is a great feature, because it doesn’t require application code to use any special database features—you may forget to use a lock or an atomic operation and thus introduce a bug, but lost update detection happens automatically and is thus less error-prone.

比较并设置

Compare-and-set

在不提供事务的数据库中,有时您会发现原子比较和设置操作(之前在“单对象写入”中提到过)。此操作的目的是通过仅当该值自上次读取以来未更改时才允许进行更新,从而避免丢失更新。如果当前值与您之前读取的值不匹配,则更新无效,并且必须重试读取-修改-写入周期。

In databases that don’t provide transactions, you sometimes find an atomic compare-and-set operation (previously mentioned in “Single-object writes”). The purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, the update has no effect, and the read-modify-write cycle must be retried.

例如,为了防止两个用户同时更新同一个 wiki 页面,您可以尝试类似的操作,期望仅当页面内容自用户开始编辑以来未更改时才会发生更新:

For example, to prevent two users concurrently updating the same wiki page, you might try something like this, expecting the update to occur only if the content of the page hasn’t changed since the user started editing it:

-- This may or may not be safe, depending on the database implementation
UPDATE wiki_pages SET content = 'new content'
  WHERE id = 1234 AND content = 'old content';
-- This may or may not be safe, depending on the database implementation
UPDATE wiki_pages SET content = 'new content'
  WHERE id = 1234 AND content = 'old content';

如果内容发生变化,不再匹配'old content',则本次更新将无效,因此需要检查更新是否生效,必要时重试。但是,如果数据库允许该WHERE子句从旧快照中读取,则此语句可能无法防止丢失更新,因为即使发生另一个并发写入,条件也可能为真。在依赖数据库之前检查它的比较和设置操作是否安全。

If the content has changed and no longer matches 'old content', this update will have no effect, so you need to check whether the update took effect and retry if necessary. However, if the database allows the WHERE clause to read from an old snapshot, this statement may not prevent lost updates, because the condition may be true even though another concurrent write is occurring. Check whether your database’s compare-and-set operation is safe before relying on it.

冲突解决和复制

Conflict resolution and replication

在复制数据库中(参见第 5 章),防止丢失更新涉及另一个层面:由于它们在多个节点上都有数据副本,并且数据可能会在不同节点上同时修改,因此需要采取一些额外的步骤来防止丢失更新。

In replicated databases (see Chapter 5), preventing lost updates takes on another dimension: since they have copies of the data on multiple nodes, and the data can potentially be modified concurrently on different nodes, some additional steps need to be taken to prevent lost updates.

锁定和比较并设置操作假定存在单个最新的数据副本。然而,具有多领导者或无领导者复制的数据库通常允许同时发生多个写入并异步复制它们,因此它们不能保证存在单个最新的数据副本。因此,基于锁或比较和设置的技术不适用于这种情况。(我们将在“线性化”中更详细地讨论这个问题。)

Locks and compare-and-set operations assume that there is a single up-to-date copy of the data. However, databases with multi-leader or leaderless replication usually allow several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or compare-and-set do not apply in this context. (We will revisit this issue in more detail in “Linearizability”.)

相反,正如“检测并发写入”中所讨论的,此类复制数据库中的常见方法是允许并发写入创建一个值的多个冲突版本(也称为同级版本),并使用应用程序代码或特殊数据结构来解决和解决问题。事后合并这些版本。

Instead, as discussed in “Detecting Concurrent Writes”, a common approach in such replicated databases is to allow concurrent writes to create several conflicting versions of a value (also known as siblings), and to use application code or special data structures to resolve and merge these versions after the fact.

原子操作可以在复制上下文中很好地工作,特别是如果它们是可交换的(即,您可以在不同的副本上以不同的顺序应用它们,但仍然得到相同的结果)。例如,递增计数器或向集合添加元素都是可交换操作。这就是 Riak 2.0 数据类型背后的想法,它可以防止跨副本丢失更新。当不同客户端同时更新一个值时,Riak 会自动将更新合并在一起,这样不会丢失任何更新 [ 39 ]。

Atomic operations can work well in a replicated context, especially if they are commutative (i.e., you can apply them in a different order on different replicas, and still get the same result). For example, incrementing a counter or adding an element to a set are commutative operations. That is the idea behind Riak 2.0 datatypes, which prevent lost updates across replicas. When a value is concurrently updated by different clients, Riak automatically merges together the updates in such a way that no updates are lost [39].

另一方面,最后写入获胜(LWW)冲突解决方法很容易丢失更新,如“最后写入获胜(丢弃并发写入)”中所述。不幸的是,LWW 是许多复制数据库中的默认设置。

On the other hand, the last write wins (LWW) conflict resolution method is prone to lost updates, as discussed in “Last write wins (discarding concurrent writes)”. Unfortunately, LWW is the default in many replicated databases.

写入偏差和幻象

Write Skew and Phantoms

在前面的部分中,我们看到了脏写丢失更新,这是当不同事务同时尝试写入同一对象时可能发生的两种竞争条件。为了避免数据损坏,需要防止这些竞争条件——要么由数据库自动,要么通过手动保护措施(例如使用锁或原子写入操作)。

In the previous sections we saw dirty writes and lost updates, two kinds of race conditions that can occur when different transactions concurrently try to write to the same objects. In order to avoid data corruption, those race conditions need to be prevented—either automatically by the database, or by manual safeguards such as using locks or atomic write operations.

然而,这并不是并发写入之间可能发生的潜在竞争条件列表的结尾。在本节中,我们将看到一些更微妙的冲突示例。

However, that is not the end of the list of potential race conditions that can occur between concurrent writes. In this section we will see some subtler examples of conflicts.

首先,想象一下这个例子:您正在为医生编写一个应用程序来管理他们在医院的值班轮班。医院通常会尝试同时安排多名医生待命,但绝对必须至少有一名待命医生。医生可以放弃轮班(例如,如果他们自己生病了),前提是至少有一名同事在该轮班中待命 [ 40 , 41 ]。

To begin, imagine this example: you are writing an application for doctors to manage their on-call shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if they are sick themselves), provided that at least one colleague remains on call in that shift [40, 41].

现在想象一下,爱丽丝和鲍勃是特定轮班的两位值班医生。两人都感觉身体不适,因此决定请假。不幸的是,他们恰好在大约同一时间点击了挂断电话按钮。接下来发生的情况如图 7-8所示 。

Now imagine that Alice and Bob are the two on-call doctors for a particular shift. Both are feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button to go off call at approximately the same time. What happens next is illustrated in Figure 7-8.

迪迪亚0708
图 7-8。写入偏差导致应用程序错误的示例。

在每笔交易中,您的应用程序首先检查当前是否有两名或更多医生待命;如果是,则假定一名医生下班是安全的。由于数据库使用快照隔离,两个检查都会返回2,因此两个事务都会进入下一阶段。Alice 更新了自己的记录以停止通话,Bob 也同样更新了自己的记录。两笔交易均已提交,但现在没有医生值班。您关于至少有一名待命医生的要求已被违反。

In each transaction, your application first checks that two or more doctors are currently on call; if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot isolation, both checks return 2, so both transactions proceed to the next stage. Alice updates her own record to take herself off call, and Bob updates his own record likewise. Both transactions commit, and now no doctor is on call. Your requirement of having at least one doctor on call has been violated.

表征写入偏差

Characterizing write skew

这种异常称为写偏斜[ 28 ]。它既不是脏写,也不是丢失更新,因为这两个事务正在更新两个不同的对象(分别是 Alice 和 Bob 的 on-call 记录)。这里发生的冲突不太明显,但这绝对是一种竞争条件:如果两个事务相继运行,那么第二位医生将无法下班。异常行为之所以可能发生,是因为事务是同时运行的。

This anomaly is called write skew [28]. It is neither a dirty write nor a lost update, because the two transactions are updating two different objects (Alice’s and Bob’s on-call records, respectively). It is less obvious that a conflict occurred here, but it’s definitely a race condition: if the two transactions had run one after another, the second doctor would have been prevented from going off call. The anomalous behavior was only possible because the transactions ran concurrently.

您可以将写入偏差视为丢失更新问题的概括。如果两个事务读取相同的对象,然后更新其中一些对象(不同的事务可能更新不同的对象),则可能会出现写入偏差。在不同事务更新同一对象的特殊情况下,您会遇到脏写或丢失更新异常(取决于时间)。

You can think of write skew as a generalization of the lost update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing).

我们看到有多种不同的方法可以防止丢失更新。对于写入偏差,我们的选择受到更多限制:

We saw that there are various different ways of preventing lost updates. With write skew, our options are more restricted:

  • 原子单对象操作没有帮助,因为涉及多个对象。

  • Atomic single-object operations don’t help, as multiple objects are involved.

  • 不幸的是,在某些快照隔离实现中发现的丢失更新的自动检测也无济于事:在 PostgreSQL 的可重复读取、MySQL/InnoDB 的可重复读取、Oracle 的可序列化或 SQL Server 的快照隔离级别中,不会自动检测写入偏差 [23 ] ]。自动防止写入倾斜需要真正的可序列化隔离(请参阅 “可序列化性”)。

  • The automatic detection of lost updates that you find in some implementations of snapshot isolation unfortunately doesn’t help either: write skew is not automatically detected in PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL Server’s snapshot isolation level [23]. Automatically preventing write skew requires true serializable isolation (see “Serializability”).

  • 某些数据库允许您配置约束,然后由数据库强制执行这些约束(例如,唯一性、外键约束或对特定值的限制)。但是,为了指定至少一名医生必须待命,您需要一个涉及多个对象的约束。大多数数据库没有对此类约束的内置支持,但您可以使用触发器或物化视图来实现它们,具体取决于数据库[ 42 ]。

  • Some databases allow you to configure constraints, which are then enforced by the database (e.g., uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to specify that at least one doctor must be on call, you would need a constraint that involves multiple objects. Most databases do not have built-in support for such constraints, but you may be able to implement them with triggers or materialized views, depending on the database [42].

  • 如果无法使用可序列化的隔离级别,那么在这种情况下,第二好的选择可能是显式锁定事务所依赖的行。在医生的示例中,您可以编写如下内容:

    BEGIN TRANSACTION;
    
    SELECT * FROM doctors
      WHERE on_call = true
      AND shift_id = 1234 FOR UPDATE; 1
    
    UPDATE doctors
      SET on_call = false
      WHERE name = 'Alice'
      AND shift_id = 1234;
    
    COMMIT;
    1

    和以前一样,FOR UPDATE告诉数据库锁定该查询返回的所有行。

  • If you can’t use a serializable isolation level, the second-best option in this case is probably to explicitly lock the rows that the transaction depends on. In the doctors example, you could write something like the following:

    BEGIN TRANSACTION;
    
    SELECT * FROM doctors
      WHERE on_call = true
      AND shift_id = 1234 FOR UPDATE; 
    
    UPDATE doctors
      SET on_call = false
      WHERE name = 'Alice'
      AND shift_id = 1234;
    
    COMMIT;

    As before, FOR UPDATE tells the database to lock all rows returned by this query.

更多写入倾斜示例

More examples of write skew

写入倾斜乍一看似乎是一个深奥的问题,但是一旦您意识到它,您可能会注意到更多可能发生这种情况的情况。这里还有一些例子:

Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice more situations in which it can occur. Here are some more examples:

会议室预订系统
Meeting room booking system

假设您想强制不能同时对同一会议室进行两次预订 [ 43 ]。当有人想要进行预订时,您首先检查是否有任何冲突的预订(即,同一房间的预订具有重叠的时间范围),如果没有找到,则创建会议(请参见示例 7-2

例7-2。会议室预订系统试图避免重复预订(在快照隔离下不安全)
BEGIN TRANSACTION;

-- Check for any existing bookings that overlap with the period of noon-1pm
SELECT COUNT(*) FROM bookings
  WHERE room_id = 123 AND
    end_time > '2015-01-01 12:00' AND start_time < '2015-01-01 13:00';

-- If the previous query returned zero:
INSERT INTO bookings
  (room_id, start_time, end_time, user_id)
  VALUES (123, '2015-01-01 12:00', '2015-01-01 13:00', 666);

COMMIT;

不幸的是,快照隔离并不能阻止其他用户同时插入冲突的会议。为了保证不会出现调度冲突,您再次需要可序列化隔离。

Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [43]. When someone wants to make a booking, you first check for any conflicting bookings (i.e., bookings for the same room with an overlapping time range), and if none are found, you create the meeting (see Example 7-2).ix

Example 7-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)
BEGIN TRANSACTION;

-- Check for any existing bookings that overlap with the period of noon-1pm
SELECT COUNT(*) FROM bookings
  WHERE room_id = 123 AND
    end_time > '2015-01-01 12:00' AND start_time < '2015-01-01 13:00';

-- If the previous query returned zero:
INSERT INTO bookings
  (room_id, start_time, end_time, user_id)
  VALUES (123, '2015-01-01 12:00', '2015-01-01 13:00', 666);

COMMIT;

Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable isolation.

多人游戏
Multiplayer game

示例 7-1中,我们使用锁来防止更新丢失(即确保两个玩家不能同时移动同一个图形)。然而,锁并不​​能阻止玩家将两个不同的人物移动到棋盘上的同一位置,或者可能做出一些违反游戏规则的其他动作。根据您要执行的规则类型,您也许能够使用唯一约束,但否则您很容易受到写入倾斜的影响。

In Example 7-1, we used a lock to prevent lost updates (that is, making sure that two players can’t move the same figure at the same time). However, the lock doesn’t prevent players from moving two different figures to the same position on the board or potentially making some other move that violates the rules of the game. Depending on the kind of rule you are enforcing, you might be able to use a unique constraint, but otherwise you’re vulnerable to write skew.

索取用户名
Claiming a username

在每个用户都有唯一用户名的网站上,两个用户可能会尝试同时使用相同的用户名创建帐户。您可以使用交易来检查名称是否被占用,如果没有,则使用该名称创建一个帐户。但是,与前面的示例一样,这在快照隔离下并不安全。幸运的是,这里唯一的约束是一个简单的解决方案(尝试注册用户名的第二个事务将由于违反约束而中止)。

On a website where each user has a unique username, two users may try to create accounts with the same username at the same time. You may use a transaction to check whether a name is taken and, if not, create an account with that name. However, like in the previous examples, that is not safe under snapshot isolation. Fortunately, a unique constraint is a simple solution here (the second transaction that tries to register the username will be aborted due to violating the constraint).

防止双重支出
Preventing double-spending

允许用户花钱或积分的服务需要检查用户的支出是否超过其拥有的金额。您可以通过将暂定支出项目插入用户的帐户、列出帐户中的所有项目并检查总和是否为正来实现此目的 [ 44 ]。在写入偏差的情况下,可能会同时插入两个支出项目,从而导致余额变为负值,但两个事务都不会注意到另一个事务。

A service that allows users to spend money or points needs to check that a user doesn’t spend more than they have. You might implement this by inserting a tentative spending item into a user’s account, listing all the items in the account, and checking that the sum is positive [44]. With write skew, it could happen that two spending items are inserted concurrently that together cause the balance to go negative, but that neither transaction notices the other.

导致写入倾斜的幻象

Phantoms causing write skew

所有这些示例都遵循类似的模式:

All of these examples follow a similar pattern:

  1. 查询SELECT通过搜索与某些搜索条件匹配的行来检查是否满足某些要求(至少有两名医生待命,当时该房间没有现有预订,板上的位置还没有另一个)数字上,用户名尚未被占用,帐户中仍有资金)。

  2. A SELECT query checks whether some requirement is satisfied by searching for rows that match some search condition (there are at least two doctors on call, there are no existing bookings for that room at that time, the position on the board doesn’t already have another figure on it, the username isn’t already taken, there is still money in the account).

  3. 根据第一个查询的结果,应用程序代码决定如何继续(可能继续操作,或者可能向用户报告错误并中止)。

  4. Depending on the result of the first query, the application code decides how to continue (perhaps to go ahead with the operation, or perhaps to report an error to the user and abort).

  5. 如果应用程序决定继续,它将向数据库写入(INSERTUPDATE或)并提交事务。DELETE

    此写入的效果会更改步骤 2 决策的前提条件。换句话说,如果在SELECT提交写入后重复步骤 1 中的查询,则会得到不同的结果,因为写入更改了匹配的行集搜索条件(现在待命的医生减少了,会议室现在已被预订,板上的位置现在被移动的数字占据,用户名现在被占据,现在的钱更少了)帐户)。

  6. If the application decides to go ahead, it makes a write (INSERT, UPDATE, or DELETE) to the database and commits the transaction.

    The effect of this write changes the precondition of the decision of step 2. In other words, if you were to repeat the SELECT query from step 1 after commiting the write, you would get a different result, because the write changed the set of rows matching the search condition (there is now one fewer doctor on call, the meeting room is now booked for that time, the position on the board is now taken by the figure that was moved, the username is now taken, there is now less money in the account).

这些步骤可能以不同的顺序发生。例如,您可以先进行写入,然后进行 SELECT查询,最后根据查询结果决定是否中止或提交。

The steps may occur in a different order. For example, you could first make the write, then the SELECT query, and finally decide whether to abort or commit based on the result of the query.

在医生待命示例中,步骤 3 中修改的行是步骤 1 中返回的行之一,因此我们可以通过锁定步骤 1 中的行来确保事务安全并避免写入偏差 ( ) SELECT FOR UPDATE。但是,其他四个示例有所不同:它们检查是否 存在与某些搜索条件匹配的行,并且写入会添加与相同条件匹配的行。如果步骤 1 中的查询未返回任何行,SELECT FOR UPDATE则无法将锁附加到任何内容。

In the case of the doctor on call example, the row being modified in step 3 was one of the rows returned in step 1, so we could make the transaction safe and avoid write skew by locking the rows in step 1 (SELECT FOR UPDATE). However, the other four examples are different: they check for the absence of rows matching some search condition, and the write adds a row matching the same condition. If the query in step 1 doesn’t return any rows, SELECT FOR UPDATE can’t attach locks to anything.

这种效应(其中一个事务中的写入会更改另一事务中的搜索查询的结果)称为幻像[ 3 ]。快照隔离避免了只读查询中的幻象,但在读写事务(如我们讨论的示例)中,幻象可能会导致特别棘手的写入倾斜情况。

This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom [3]. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples we discussed, phantoms can lead to particularly tricky cases of write skew.

冲突具体化

Materializing conflicts

如果幻象的问题是没有我们可以附加锁的对象,也许我们可以人为地向数据库中引入一个锁对象?

If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we can artificially introduce a lock object into the database?

例如,在会议室预订案例中,您可以想象创建一个时间段和房间表。该表中的每一行对应于特定时间段(例如 15 分钟)的特定房间。您可以提前为房间和时间段的所有可能组合创建行,例如接下来的六个月。

For example, in the meeting room booking case you could imagine creating a table of time slots and rooms. Each row in this table corresponds to a particular room for a particular time period (say, 15 minutes). You create rows for all possible combinations of rooms and time periods ahead of time, e.g. for the next six months.

现在,想要创建预订的事务可以锁定 ( SELECT FOR UPDATE) 表中与所需房间和时间段相对应的行。获得锁后,它可以检查重叠的预订并像以前一样插入新的预订。请注意,附加表不用于存储有关预订的信息 - 它纯粹是一组锁,用于防止同一房间和时间范围内的预订同时被修改。

Now a transaction that wants to create a booking can lock (SELECT FOR UPDATE) the rows in the table that correspond to the desired room and time period. After it has acquired the locks, it can check for overlapping bookings and insert a new booking as before. Note that the additional table isn’t used to store information about the booking—it’s purely a collection of locks which is used to prevent bookings on the same room and time range from being modified concurrently.

这种方法称为具体化冲突,因为它采用幻像并将其转换为数据库中存在的一组具体行上的锁定冲突[ 11 ]。不幸的是,弄清楚如何实现冲突可能很困难并且容易出错,而且让并发控制机制泄漏到应用程序数据模型中也是很丑陋的。出于这些原因,如果别无选择,那么实现冲突应被视为最后的手段。在大多数情况下,可串行化的隔离级别更为可取。

This approach is called materializing conflicts, because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database [11]. Unfortunately, it can be hard and error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control mechanism leak into the application data model. For those reasons, materializing conflicts should be considered a last resort if no alternative is possible. A serializable isolation level is much preferable in most cases.

可串行化

Serializability

在本章中,我们看到了几个容易出现竞争条件的事务示例。某些竞争条件可以通过已提交读隔离级别和快照隔离级别来防止,但其他条件则不能。我们遇到了一些特别棘手的例子,其中包括写入倾斜和幻象。这是一个可悲的情况:

In this chapter we have seen several examples of transactions that are prone to race conditions. Some race conditions are prevented by the read committed and snapshot isolation levels, but others are not. We encountered some particularly tricky examples with write skew and phantoms. It’s a sad situation:

  • 隔离级别很难理解,并且在不同的数据库中实现不一致(例如,“可重复读”的含义差异很大)。

  • Isolation levels are hard to understand, and inconsistently implemented in different databases (e.g., the meaning of “repeatable read” varies significantly).

  • 如果您查看应用程序代码,则很难判断在特定隔离级别上运行是否安全,尤其是在大型应用程序中,您可能不知道可能同时发生的所有事情。

  • If you look at your application code, it’s difficult to tell whether it is safe to run at a particular isolation level—especially in a large application, where you might not be aware of all the things that may be happening concurrently.

  • 没有好的工具可以帮助我们检测竞争条件。原则上,静态分析可能会有所帮助[ 26 ],但研究技术尚未进入实际应用。测试并发问题很困难,因为它们通常是不确定的——只有在时机不走运时才会出现问题。

  • There are no good tools to help us detect race conditions. In principle, static analysis may help [26], but research techniques have not yet found their way into practical use. Testing for concurrency issues is hard, because they are usually nondeterministic—problems only occur if you get unlucky with the timing.

这并不是一个新问题——自 20 世纪 70 年代首次引入弱隔离级别以来就一直如此 [ 2 ]。一直以来,研究人员的答案都很简单:使用可序列化隔离!

This is not a new problem—it has been like this since the 1970s, when weak isolation levels were first introduced [2]. All along, the answer from researchers has been simple: use serializable isolation!

可串行隔离通常被认为是最强的隔离级别。它保证即使事务可能并行执行,但最终结果与一次串行执行一个事务相同,没有任何并发​​性因此,数据库保证,如果事务在单独运行时行为正确,那么它们在并发运行时也将继续正确,换句话说,数据库可以防止所有可能的竞争条件。

Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. Thus, the database guarantees that if the transactions behave correctly when run individually, they continue to be correct when run concurrently—in other words, the database prevents all possible race conditions.

但是,如果可序列化隔离比混乱的弱隔离级别好得多,那么为什么不是每个人都使用它呢?为了回答这个问题,我们需要看看实现可序列化的选项以及它们的执行方式。如今提供可序列化性的大多数数据库都使用三种技术之一,我们将在本章的其余部分中探讨这些技术:

But if serializable isolation is so much better than the mess of weak isolation levels, then why isn’t everyone using it? To answer this question, we need to look at the options for implementing serializability, and how they perform. Most databases that provide serializability today use one of three techniques, which we will explore in the rest of this chapter:

现在,我们将主要在单节点数据库的背景下讨论这些技术;在 第 9 章中,我们将研究如何将它们推广到涉及分布式系统中多个节点的事务。

For now, we will discuss these techniques primarily in the context of single-node databases; in Chapter 9 we will examine how they can be generalized to transactions that involve multiple nodes in a distributed system.

实际串行执行

Actual Serial Execution

避免并发问题的最简单方法是完全消除并发:在单个线程上按串行顺序一次仅执行一个事务。通过这样做,我们完全回避了检测和防止事务之间冲突的问题:根据定义,产生的隔离是可序列化的。

The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely sidestep the problem of detecting and preventing conflicts between transactions: the resulting isolation is by definition serializable.

尽管这似乎是一个显而易见的想法,但数据库设计者直到最近(大约 2007 年)才决定用于执行事务的单线程循环是可行的 [ 45 ]。如果在过去的 30 年里多线程并发被认为是获得良好性能的关键,那么是什么改变使得单线程执行成为可能呢?

Even though this seems like an obvious idea, database designers only fairly recently—around 2007—decided that a single-threaded loop for executing transactions was feasible [45]. If multi-threaded concurrency was considered essential for getting good performance during the previous 30 years, what changed to make single-threaded execution possible?

两个事态发展引起了这种重新思考:

Two developments caused this rethink:

  • RAM 变得足够便宜,对于许多用例来说,现在可以将整个活动数据集保留在内存中(请参阅“将所有内容保留在内存中”)。当事务需要访问的所有数据都在内存中时,事务的执行速度比必须等待从磁盘加载数据时快得多。

  • RAM became cheap enough that for many use cases is now feasible to keep the entire active dataset in memory (see “Keeping everything in memory”). When all data that a transaction needs to access is in memory, transactions can execute much faster than if they have to wait for data to be loaded from disk.

  • 数据库设计者意识到 OLTP 事务通常很短,并且只进行少量的读取和写入(请参阅“事务处理还是分析?”)。相比之下,长时间运行的分析查询通常是只读的,因此它们可以在串行执行循环之外的一致快照上运行(使用快照隔离)。

  • Database designers realized that OLTP transactions are usually short and only make a small number of reads and writes (see “Transaction Processing or Analytics?”). By contrast, long-running analytic queries are typically read-only, so they can be run on a consistent snapshot (using snapshot isolation) outside of the serial execution loop.

串行执行事务的方法在 VoltDB/H-Store、Redis 和 Datomic实现 [ 46,47,48 ]。为单线程执行而设计的系统有时比支持并发的系统性能更好,因为它可以避免锁定的协调开销。然而,其吞吐量仅限于单个CPU核心的吞吐量。为了充分利用单线程,事务的结构需要不同于传统形式。

The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic [46, 47, 48]. A system designed for single-threaded execution can sometimes perform better than a system that supports concurrency, because it can avoid the coordination overhead of locking. However, its throughput is limited to that of a single CPU core. In order to make the most of that single thread, transactions need to be structured differently from their traditional form.

将事务封装在存储过程中

Encapsulating transactions in stored procedures

在数据库的早期,其目的是数据库事务可以包含整个用户活动流。例如,预订机票是一个多阶段过程(搜索航线、票价和可用座位;决定行程;预订行程中每个航班的座位;输入乘客详细信息;付款)。数据库设计者认为,如果整个过程是一个事务,这样就可以原子提交,那就太好了。

In the early days of databases, the intention was that a database transaction could encompass an entire flow of user activity. For example, booking an airline ticket is a multi-stage process (searching for routes, fares, and available seats; deciding on an itinerary; booking seats on each of the flights of the itinerary; entering passenger details; making payment). Database designers thought that it would be neat if that entire process was one transaction so that it could be committed atomically.

不幸的是,人类下定决心和做出反应的速度非常慢。如果数据库事务需要等待用户的输入,则数据库需要支持潜在的大量并发事务,其中大多数事务处于闲置状态。大多数数据库无法有效地做到这一点,因此几乎所有 OLTP 应用程序都通过避免在事务中交互等待用户来保持事务简短。在 Web 上,这意味着事务是在同一个 HTTP 请求内提交的,事务不会跨越多个请求。新的 HTTP 请求启动新的事务。

Unfortunately, humans are very slow to make up their minds and respond. If a database transaction needs to wait for input from a user, the database needs to support a potentially huge number of concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost all OLTP applications keep transactions short by avoiding interactively waiting for a user within a transaction. On the web, this means that a transaction is committed within the same HTTP request—a transaction does not span multiple requests. A new HTTP request starts a new transaction.

即使人类已脱离关键路径,事务仍继续以交互式客户端/服务器方式执行,一次一个语句。应用程序进行查询,读取结果,也许根据第一个查询的结果进行另一个查询,等等。查询和结果在应用程序代码(在一台计算机上运行)和数据库服务器(在另一台计算机上运行)之间来回发送。

Even though the human has been taken out of the critical path, transactions have continued to be executed in an interactive client/server style, one statement at a time. An application makes a query, reads the result, perhaps makes another query depending on the result of the first query, and so on. The queries and results are sent back and forth between the application code (running on one machine) and the database server (on another machine).

在这种交互方式的事务中,大量时间花费在应用程序和数据库之间的网络通信上。如果您禁止数据库中的并发性并且一次只处理一个事务,那么吞吐量将非常糟糕,因为数据库将花费大部分时间等待应用程序为当前事务发出下一个查询。在这种数据库中,需要同时处理多个事务才能获得合理的性能。

In this interactive style of transaction, a lot of time is spent in network communication between the application and the database. If you were to disallow concurrency in the database and only process one transaction at a time, the throughput would be dreadful because the database would spend most of its time waiting for the application to issue the next query for the current transaction. In this kind of database, it’s necessary to process multiple transactions concurrently in order to get reasonable performance.

因此,具有单线程串行事务处理的系统不允许交互式多语句事务。相反,应用程序必须提前将整个事务代码作为存储过程提交到数据库。这些方法之间的差异如图 7-9所示。如果事务所需的所有数据都在内存中,则存储过程可以非常快地执行,而无需等待任何网络或磁盘 I/O。

For this reason, systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must submit the entire transaction code to the database ahead of time, as a stored procedure. The differences between these approaches is illustrated in Figure 7-9. Provided that all data required by a transaction is in memory, the stored procedure can execute very fast, without waiting for any network or disk I/O.

直达0709
图 7-9。交互式事务和存储过程之间的区别(使用图 7-8的示例事务)。

存储过程的优缺点

Pros and cons of stored procedures

存储过程在关系数据库中已经存在了一段时间,并且自 1999 年起就成为 SQL 标准 (SQL/PSM) 的一部分。由于多种原因,它们的声誉有些不佳:

Stored procedures have existed for some time in relational databases, and they have been part of the SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons:

  • 每个数据库供应商都有自己的存储过程语言(Oracle 有 PL/SQL、SQL Server 有 T-SQL、PostgreSQL 有 PL/pgSQL 等)。这些语言没有跟上通用编程语言的发展,因此从今天的角度来看,它们看起来相当丑陋和过时,并且它们缺乏大多数编程语言所具有的库生态系统。

  • Each database vendor has its own language for stored procedures (Oracle has PL/SQL, SQL Server has T-SQL, PostgreSQL has PL/pgSQL, etc.). These languages haven’t kept up with developments in general-purpose programming languages, so they look quite ugly and archaic from today’s point of view, and they lack the ecosystem of libraries that you find with most programming languages.

  • 在数据库中运行的代码很难管理:与应用程序服务器相比,它更难调试,更难以保持版本控制和部署,更难以测试,并且难以与指标收集系统集成以进行监控。

  • Code running in a database is difficult to manage: compared to an application server, it’s harder to debug, more awkward to keep in version control and deploy, trickier to test, and difficult to integrate with a metrics collection system for monitoring.

  • 数据库通常比应用程序服务器对性能更加敏感,因为单个数据库实例通常由许多应用程序服务器共享。数据库中编写不当的存储过程(例如,使用大量内存或CPU 时间)可能比应用程序服务器中同等编写不当的代码造成更多的麻烦。

  • A database is often much more performance-sensitive than an application server, because a single database instance is often shared by many application servers. A badly written stored procedure (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent badly written code in an application server.

然而,这些问题是可以克服的。现代存储过程的实现已经放弃了 PL/SQL,转而使用现有的通用编程语言:VoltDB 使用 Java 或 Groovy,Datomic 使用 Java 或 Clojure,Redis 使用 Lua。

However, those issues can be overcome. Modern implementations of stored procedures have abandoned PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy, Datomic uses Java or Clojure, and Redis uses Lua.

借助存储过程和内存中数据,在单个线程上执行所有事务变得可行。由于它们不需要等待 I/O 并且避免了其他并发控制机制的开销,因此它们可以在单个线程上实现相当好的吞吐量。

With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible. As they don’t need to wait for I/O and they avoid the overhead of other concurrency control mechanisms, they can achieve quite good throughput on a single thread.

VoltDB 还使用存储过程进行复制:它不是将事务的写入从一个节点复制到另一个节点,而是在每个副本上执行相同的存储过程。因此,VoltDB 要求存储过程是确定性的(当在不同节点上运行时,它们必须产生相同的结果)。例如,如果事务需要使用当前日期和时间,则必须通过特殊的确定性 API 来实现。

VoltDB also uses stored procedures for replication: instead of copying a transaction’s writes from one node to another, it executes the same stored procedure on each replica. VoltDB therefore requires that stored procedures are deterministic (when run on different nodes, they must produce the same result). If a transaction needs to use the current date and time, for example, it must do so through special deterministic APIs.

分区

Partitioning

串行执行所有事务使并发控制变得更加简单,但将数据库的事务吞吐量限制为单机上单个CPU核心的速度。只读事务可以使用快照隔离在其他地方执行,但对于具有高写入吞吐量的应用程序,单线程事务处理器可能成为严重的瓶颈。

Executing all transactions serially makes concurrency control much simpler, but limits the transaction throughput of the database to the speed of a single CPU core on a single machine. Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with high write throughput, the single-threaded transaction processor can become a serious bottleneck.

为了扩展到多个 CPU 核心和多个节点,您可以对数据进行分区(请参阅第 6 章),这在 VoltDB 中受支持。如果您可以找到一种对数据集进行分区的方法,以便每个事务只需要在单个分区中读取和写入数据,那么每个分区都可以有自己的事务处理线程,独立于其他分区运行。在这种情况下,您可以为每个 CPU 核心分配自己的分区,这使您的事务吞吐量可以随着 CPU 核心的数量线性扩展[ 47 ]。

In order to scale to multiple CPU cores, and multiple nodes, you can potentially partition your data (see Chapter 6), which is supported in VoltDB. If you can find a way of partitioning your dataset so that each transaction only needs to read and write data within a single partition, then each partition can have its own transaction processing thread running independently from the others. In this case, you can give each CPU core its own partition, which allows your transaction throughput to scale linearly with the number of CPU cores [47].

但是,对于任何需要访问多个分区的事务,数据库必须在其涉及的所有分区之间协调该事务。存储过程需要在所有分区上以锁步方式执行,以确保整个系统的可串行性。

However, for any transaction that needs to access multiple partitions, the database must coordinate the transaction across all the partitions that it touches. The stored procedure needs to be performed in lock-step across all partitions to ensure serializability across the whole system.

由于跨分区事务具有额外的协调开销,因此它们比单分区事务慢得多。VoltDB 报告的吞吐量约为每秒 1,000 次跨分区写入,这比单分区吞吐量低几个数量级,并且无法通过添加更多机器来增加 [49 ]

Since cross-partition transactions have additional coordination overhead, they are vastly slower than single-partition transactions. VoltDB reports a throughput of about 1,000 cross-partition writes per second, which is orders of magnitude below its single-partition throughput and cannot be increased by adding more machines [49].

事务是否可以是单分区很大程度上取决于应用程序所使用的数据的结构。简单的键值数据通常可以非常容易地进行分区,但是具有多个二级索引的数据可能需要大量的跨分区协调(请参阅 “分区和二级索引”)。

Whether transactions can be single-partition depends very much on the structure of the data used by the application. Simple key-value data can often be partitioned very easily, but data with multiple secondary indexes is likely to require a lot of cross-partition coordination (see “Partitioning and Secondary Indexes”).

串行执行总结

Summary of serial execution

事务的串行执行已成为在某些约束下实现可序列化隔离的可行方法:

Serial execution of transactions has become a viable way of achieving serializable isolation within certain constraints:

  • 每一笔交易都必须小而快,因为只需要一个慢速交易就能阻止所有交易处理。

  • Every transaction must be small and fast, because it takes only one slow transaction to stall all transaction processing.

  • 它仅限于活动数据集可以容纳在内存中的用例。很少访问的数据可能会被移动到磁盘,但如果需要在单线程事务中访问它,系统会变得非常慢。X

  • It is limited to use cases where the active dataset can fit in memory. Rarely accessed data could potentially be moved to disk, but if it needed to be accessed in a single-threaded transaction, the system would get very slow.x

  • 写入吞吐量必须足够低,以便能够在单个 CPU 核心上处理,否则需要对事务进行分区,而不需要跨分区协调。

  • Write throughput must be low enough to be handled on a single CPU core, or else transactions need to be partitioned without requiring cross-partition coordination.

  • 跨分区事务是可能的,但它们的使用范围存在硬性限制。

  • Cross-partition transactions are possible, but there is a hard limit to the extent to which they can be used.

两相锁定 (2PL)

Two-Phase Locking (2PL)

大约 30 年来,数据库中只有一种广泛使用的可序列化算法: 两阶段锁定(2PL)。

For around 30 years, there was only one widely used algorithm for serializability in databases: two-phase locking (2PL).xi

2PL 不是 2PC

2PL is not 2PC

请注意,虽然两阶段锁定(2PL) 听起来与两阶段提交(2PC) 非常相似,但它们是完全不同的东西。我们将在第 9 章讨论 2PC 。

Note that while two-phase locking (2PL) sounds very similar to two-phase commit (2PC), they are completely different things. We will discuss 2PC in Chapter 9.

我们之前看到,锁经常用于防止脏写(请参阅 “无脏写”):如果两个事务同时尝试写入同一个对象,则锁确保第二个写入者必须等待,直到第一个写入完成其事务(中止或提交)在它可以继续之前。

We saw previously that locks are often used to prevent dirty writes (see “No dirty writes”): if two transactions concurrently try to write to the same object, the lock ensures that the second writer must wait until the first one has finished its transaction (aborted or committed) before it may continue.

两阶段锁定类似,但对锁定的要求更强。只要没有人写入同一对象,就允许多个事务同时读取该对象。但是一旦有人想要写入(修改或删除)一个对象,就需要独占访问:

Two-phase locking is similar, but makes the lock requirements much stronger. Several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required:

  • 如果事务 A 读取了一个对象,而事务 B 想要写入该对象,则 B 必须等到 A 提交或中止后才能继续。(这确保 B 不会在 A 背后意外地更改对象。)

  • If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts before it can continue. (This ensures that B can’t change the object unexpectedly behind A’s back.)

  • 如果事务 A 写入了一个对象,而事务 B 想要读取该对象,则 B 必须等到 A 提交或中止后才能继续。(读取对象的旧版本,如图 7-1所示,在 2PL 下是不可接受的。)

  • If transaction A has written an object and transaction B wants to read that object, B must wait until A commits or aborts before it can continue. (Reading an old version of the object, like in Figure 7-1, is not acceptable under 2PL.)

在 2PL 中,作者不仅会阻止其他作者,还会阻止其他作者。他们还会阻止读者,反之亦然。快照隔离的口头禅是读取器永远不会阻止写入器,写入器也永远不会阻止读取器(请参阅“实现快照隔离”),这体现了快照隔离和两阶段锁定之间的关键区别。另一方面,由于 2PL 提供了可串行性,因此它可以防止前面讨论的所有竞争条件,包括更新丢失和写入偏差。

In 2PL, writers don’t just block other writers; they also block readers and vice versa. Snapshot isolation has the mantra readers never block writers, and writers never block readers (see “Implementing snapshot isolation”), which captures this key difference between snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability, it protects against all the race conditions discussed earlier, including lost updates and write skew.

两阶段锁定的实现

Implementation of two-phase locking

MySQL (InnoDB) 和 SQL Server 中的可序列化隔离级别以及 DB2 中的可重复读隔离级别使用 2PL [ 23 , 36 ]。

2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the repeatable read isolation level in DB2 [23, 36].

读取器和写入器的阻塞是通过对数据库中的每个对象加锁来实现的。锁可以处于共享模式独占模式。锁的使用方法如下:

The blocking of readers and writers is implemented by a having a lock on each object in the database. The lock can either be in shared mode or in exclusive mode. The lock is used as follows:

  • 如果一个事务想要读取一个对象,它必须首先以共享模式获取锁。允许多个事务同时以共享模式持有锁,但如果另一个事务已经拥有该对象的排它锁,则这些事务必须等待。

  • If a transaction wants to read an object, it must first acquire the lock in shared mode. Several transactions are allowed to hold the lock in shared mode simultaneously, but if another transaction already has an exclusive lock on the object, these transactions must wait.

  • 如果一个事务想要写入一个对象,它必须首先以独占模式获取锁。没有其他事务可以同时持有该锁(无论是共享模式还是独占模式),因此如果该对象上存在任何现有锁,则该事务必须等待。

  • If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No other transaction may hold the lock at the same time (either in shared or in exclusive mode), so if there is any existing lock on the object, the transaction must wait.

  • 如果一个事务先读然后写一个对象,它可能会将其共享锁升级为排他锁。升级与直接获取独占锁的效果相同。

  • If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock. The upgrade works the same as getting an exclusive lock directly.

  • 事务获取锁后,必须继续持有锁,直到事务结束(提交或中止)。这就是“双阶段”名称的由来:第一阶段(事务执行时)是获取锁时,第二阶段(事务结束时)是释放所有锁时。

  • After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction (commit or abort). This is where the name “two-phase” comes from: the first phase (while the transaction is executing) is when the locks are acquired, and the second phase (at the end of the transaction) is when all the locks are released.

由于使用了如此多的锁,很容易发生事务 A 被卡住等待事务 B 释放其锁,反之亦然。这种情况称为死锁。数据库自动检测事务之间的死锁并中止其中一个事务,以便其他事务能够取得进展。应用程序需要重试已中止的事务。

Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for transaction B to release its lock, and vice versa. This situation is called deadlock. The database automatically detects deadlocks between transactions and aborts one of them so that the others can make progress. The aborted transaction needs to be retried by the application.

两阶段锁定的性能

Performance of two-phase locking

两阶段锁定的一大缺点是性能,也是自 20 世纪 70 年代以来并未被所有人使用的原因:两阶段锁定下的事务吞吐量和查询响应时间比弱隔离下要差得多。

The big downside of two-phase locking, and the reason why it hasn’t been used by everybody since the 1970s, is performance: transaction throughput and response times of queries are significantly worse under two-phase locking than under weak isolation.

这部分是由于获取和释放所有这些锁的开销,但更重要的是由于并发性的降低。根据设计,如果两个并发事务尝试执行任何可能以任何方式导致竞争条件的操作,则一个事务必须等待另一个事务完成。

This is partly due to the overhead of acquiring and releasing all those locks, but more importantly due to reduced concurrency. By design, if two concurrent transactions try to do anything that may in any way result in a race condition, one has to wait for the other to complete.

传统的关系数据库不限制事务的持续时间,因为它们是为等待人工输入的交互式应用程序而设计的。因此,当一个事务必须等待另一事务时,它可能需要等待的时间没有限制。即使您确保所有事务保持简短,如果多个事务想要访问同一对象,也可能会形成队列,因此一个事务可能必须等待其他几个事务完成才能执行任何操作。

Traditional relational databases don’t limit the duration of a transaction, because they are designed for interactive applications that wait for human input. Consequently, when one transaction has to wait on another, there is no limit on how long it may have to wait. Even if you make sure that you keep all your transactions short, a queue may form if several transactions want to access the same object, so a transaction may have to wait for several others to complete before it can do anything.

因此,运行 2PL 的数据库可能具有相当不稳定的延迟,并且如果工作负载中存在争用,它们在高百分位数时可能会非常慢(请参阅“描述性能” )。可能只需要一个缓慢的事务,或者一个访问大量数据并获取许多锁的事务,就会导致系统的其余部分陷入停顿。当需要稳健运行时,这种不稳定性会产生问题。

For this reason, databases running 2PL can have quite unstable latencies, and they can be very slow at high percentiles (see “Describing Performance”) if there is contention in the workload. It may take just one slow transaction, or one transaction that accesses a lot of data and acquires many locks, to cause the rest of the system to grind to a halt. This instability is problematic when robust operation is required.

尽管基于锁的已提交读隔离级别可能会发生死锁,但在 2PL 可序列化隔离下,死锁发生的频率要高得多(取决于事务的访问模式)。这可能是一个额外的性能问题:当事务由于死锁而中止并重试时,它需要重新完成其工作。如果死锁频繁发生,这可能意味着大量的努力被浪费。

Although deadlocks can happen with the lock-based read committed isolation level, they occur much more frequently under 2PL serializable isolation (depending on the access patterns of your transaction). This can be an additional performance problem: when a transaction is aborted due to deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can mean significant wasted effort.

谓词锁

Predicate locks

在前面对锁的描述中,我们忽略了一个微妙但重要的细节。在 “导致写入偏差的幻象”中,我们讨论了幻象问题,即一个事务改变了另一个事务的搜索查询的结果。具有可序列化隔离的数据库必须防止幻象。

In the preceding description of locks, we glossed over a subtle but important detail. In “Phantoms causing write skew” we discussed the problem of phantoms—that is, one transaction changing the results of another transaction’s search query. A database with serializable isolation must prevent phantoms.

在会议室预订示例中,这意味着如果一个事务在某个时间窗口内搜索了某个房间的现有预订(请参见示例 7-2),则不允许另一个事务同时插入或更新同一房间的另一个预订,并且时间范围。(可以同时插入其他房间的预订,或者在不同时间插入同一房间的预订,这不会影响建议的预订。)

In the meeting room booking example this means that if one transaction has searched for existing bookings for a room within a certain time window (see Example 7-2), another transaction is not allowed to concurrently insert or update another booking for the same room and time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a different time that doesn’t affect the proposed booking.)

我们如何实现这一点?从概念上讲,我们需要一个谓词锁 [ 3 ]。它的工作方式与前面描述的共享/独占锁类似,但它不属于特定对象(例如表中的一行),而是属于与某些搜索条件匹配的所有对象,例如:

How do we implement this? Conceptually, we need a predicate lock [3]. It works similarly to the shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition, such as:

SELECT * FROM bookings
  WHERE room_id = 123 AND
    end_time   > '2018-01-01 12:00' AND
    start_time < '2018-01-01 13:00';
SELECT * FROM bookings
  WHERE room_id = 123 AND
    end_time   > '2018-01-01 12:00' AND
    start_time < '2018-01-01 13:00';

谓词锁限制访问如下:

A predicate lock restricts access as follows:

  • 如果事务 A 想要读取匹配某个条件的对象(例如在该SELECT查询中),它必须获取查询条件的共享模式谓词锁。如果另一个事务 B 当前对符合这些条件的任何对象拥有独占锁,则 A 必须等到 B 释放其锁后才能进行查询。

  • If transaction A wants to read objects matching some condition, like in that SELECT query, it must acquire a shared-mode predicate lock on the conditions of the query. If another transaction B currently has an exclusive lock on any object matching those conditions, A must wait until B releases its lock before it is allowed to make its query.

  • 如果事务 A 想要插入、更新或删除任何对象,它必须首先检查旧值或新值是否与任何现有谓词锁匹配。如果事务 B 持有匹配的谓词锁,则 A 必须等到 B 提交或中止后才能继续。

  • If transaction A wants to insert, update, or delete any object, it must first check whether either the old or the new value matches any existing predicate lock. If there is a matching predicate lock held by transaction B, then A must wait until B has committed or aborted before it can continue.

这里的关键思想是谓词锁甚至适用于数据库中尚不存在但将来可能添加的对象(幻影)。如果两阶段锁定包括谓词锁,则数据库可以防止所有形式的写入偏差和其他竞争条件,因此其隔离变得可序列化。

The key idea here is that a predicate lock applies even to objects that do not yet exist in the database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions, and so its isolation becomes serializable.

索引范围锁定

Index-range locks

不幸的是,谓词锁的性能不佳:如果活动事务有很多锁,则检查匹配的锁会变得非常耗时。因此,大多数具有 2PL 的数据库实际上实现了索引范围锁定(也称为下一个键锁定),这是谓词锁定的简化近似 [ 41 , 50 ]。

Unfortunately, predicate locks do not perform well: if there are many locks by active transactions, checking for matching locks becomes time-consuming. For that reason, most databases with 2PL actually implement index-range locking (also known as next-key locking), which is a simplified approximation of predicate locking [41, 50].

通过使谓词匹配更大的对象集来简化谓词是安全的。例如,如果您对中午和下午 1 点之间 123 号房间的预订有谓词锁,您可以通过随时锁定 123 号房间的预订来近似它,或者您可以通过锁定之间的所有房间(不仅仅是 123 号房间)来近似它。中午和下午 1 点 这是安全的,因为任何与原始谓词匹配的写入肯定也会与近似值匹配。

It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just room 123) between noon and 1 p.m. This is safe, because any write that matches the original predicate will definitely also match the approximations.

在房间预订数据库中,您可能会在room_id列上有一个索引,和/或在start_time和上有索引end_time(否则前面的查询在大型数据库上会非常慢):

In the room bookings database you would probably have an index on the room_id column, and/or indexes on start_time and end_time (otherwise the preceding query would be very slow on a large database):

  • 假设您的索引为 on room_id,数据库使用该索引来查找 123 号房间的现有预订。现在数据库可以简单地将共享锁附加到该索引条目,表明事务已搜索 123 号房间的预订。

  • Say your index is on room_id, and the database uses this index to find existing bookings for room 123. Now the database can simply attach a shared lock to this index entry, indicating that a transaction has searched for bookings of room 123.

  • 或者,如果数据库使用基于时间的索引来查找现有预订,它可以将共享锁附加到该索引中的一系列值,指示事务已搜索与中午到下午 1 点时间段重叠的预订2018 年 1 月 1 日。

  • Alternatively, if the database uses a time-based index to find existing bookings, it can attach a shared lock to a range of values in that index, indicating that a transaction has searched for bookings that overlap with the time period of noon to 1 p.m. on January 1, 2018.

无论哪种方式,搜索条件的近似值都会附加到其中一个索引。现在,如果另一个事务想要插入、更新或删除同一房间和/或重叠时间段的预订,则必须更新索引的同一部分。在这样做的过程中,它会遇到共享锁,它会被迫等待,直到锁被释放。

Either way, an approximation of the search condition is attached to one of the indexes. Now, if another transaction wants to insert, update, or delete a booking for the same room and/or an overlapping time period, it will have to update the same part of the index. In the process of doing so, it will encounter the shared lock, and it will be forced to wait until the lock is released.

这可以有效防止幻像和写入偏差。索引范围锁不如谓词锁那么精确(它们可能会锁定比维护可串行性严格所需范围更大的对象范围),但由于它们的开销要低得多,因此它们是一个很好的折衷方案。

This provides effective protection against phantoms and write skew. Index-range locks are not as precise as predicate locks would be (they may lock a bigger range of objects than is strictly necessary to maintain serializability), but since they have much lower overheads, they are a good compromise.

如果没有合适的索引可以附加范围锁,则数据库可以回退到整个表上的共享锁。这对性能不利,因为它将阻止所有其他事务写入表,但这是一个安全的后备位置。

If there is no suitable index where a range lock can be attached, the database can fall back to a shared lock on the entire table. This will not be good for performance, since it will stop all other transactions writing to the table, but it’s a safe fallback position.

可串行快照隔离 (SSI)

Serializable Snapshot Isolation (SSI)

本章描绘了数据库并发控制的惨淡景象。一方面,我们的可串行性实现表现不佳(两阶段锁定)或扩展性不佳(串行执行)。另一方面,我们的隔离级别较弱,性能良好,但容易出现各种竞争条件(更新丢失、写入倾斜、幻像等)。可序列化隔离和良好的性能从根本上是相互矛盾的吗?

This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we have implementations of serializability that don’t perform well (two-phase locking) or don’t scale well (serial execution). On the other hand, we have weak isolation levels that have good performance, but are prone to various race conditions (lost updates, write skew, phantoms, etc.). Are serializable isolation and good performance fundamentally at odds with each other?

也许不是:一种称为可序列化快照隔离(SSI)的算法非常有前途。它提供完全的可串行化,但与快照隔离相比,性能损失很小。SSI 相当新:它于 2008 年首次被描述[ 40 ],并且是 Michael Cahill 博士论文的主题[ 51 ]。

Perhaps not: an algorithm called serializable snapshot isolation (SSI) is very promising. It provides full serializability, but has only a small performance penalty compared to snapshot isolation. SSI is fairly new: it was first described in 2008 [40] and is the subject of Michael Cahill’s PhD thesis [51].

如今,SSI 既用于单节点数据库(PostgreSQL 自版本 9.1 [ 41 ] 起的可序列化隔离级别),也用于分布式数据库(FoundationDB 使用类似的算法)。由于与其他并发控制机制相比,SSI 还很年轻,它仍在实践中证明其性能,但它有可能足够快,成为未来新的默认机制。

Today SSI is used both in single-node databases (the serializable isolation level in PostgreSQL since version 9.1 [41]) and distributed databases (FoundationDB uses a similar algorithm). As SSI is so young compared to other concurrency control mechanisms, it is still proving its performance in practice, but it has the possibility of being fast enough to become the new default in the future.

悲观与乐观并发控制

Pessimistic versus optimistic concurrency control

两阶段锁定是一种所谓的悲观并发控制机制:它基于这样的原则:如果任何事情可能出错(如另一个事务持有的锁所表明的那样),最好等到情况再次安全后再进行做任何事情。它就像互斥一样,用于在多线程编程中保护数据结构。

Two-phase locking is a so-called pessimistic concurrency control mechanism: it is based on the principle that if anything might possibly go wrong (as indicated by a lock held by another transaction), it’s better to wait until the situation is safe again before doing anything. It is like mutual exclusion, which is used to protect data structures in multi-threaded programming.

从某种意义上说,串行执行是悲观到了极点:它本质上相当于每个事务在事务持续时间内对整个数据库(或数据库的一个分区)拥有排他锁。我们通过使每个事务执行速度非常快来弥补悲观情绪,因此它只需要保持“锁”很短的时间。

Serial execution is, in a sense, pessimistic to the extreme: it is essentially equivalent to each transaction having an exclusive lock on the entire database (or one partition of the database) for the duration of the transaction. We compensate for the pessimism by making each transaction very fast to execute, so it only needs to hold the “lock” for a short time.

相比之下,可序列化快照隔离是一种乐观并发控制技术。在这种情况下,乐观意味着如果发生潜在危险的情况,交易不会阻塞,而是继续进行,希望一切都会好起来。当一个事务想要提交时,数据库会检查是否发生了任何不好的事情(即是否违反了隔离);如果是,则事务将中止并必须重试。仅允许提交串行执行的事务。

By contrast, serializable snapshot isolation is an optimistic concurrency control technique. Optimistic in this context means that instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will turn out all right. When a transaction wants to commit, the database checks whether anything bad happened (i.e., whether isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions that executed serializably are allowed to commit.

乐观并发控制是一个古老的想法[ 52 ],它的优点和缺点已经争论了很长时间[ 53 ]。如果存在高争用(许多事务试图访问相同的对象),它的性能就会很差,因为这会导致很大比例的事务需要中止。如果系统已经接近其最大吞吐量,则重试事务带来的额外事务负载可能会使性能变差。

Optimistic concurrency control is an old idea [52], and its advantages and disadvantages have been debated for a long time [53]. It performs badly if there is high contention (many transactions trying to access the same objects), as this leads to a high proportion of transactions needing to abort. If the system is already close to its maximum throughput, the additional transaction load from retried transactions can make performance worse.

但是,如果有足够的备用容量,并且事务之间的争用不太高,则乐观并发控制技术往往比悲观并发控制技术表现更好。通过交换原子操作可以减少争用:例如,如果多个事务同时想要递增计数器,则增量的应用顺序并不重要(只要不在同一事务中读取计数器),因此并发增量都可以应用而不会发生冲突。

However, if there is enough spare capacity, and if contention between transactions is not too high, optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention can be reduced with commutative atomic operations: for example, if several transactions concurrently want to increment a counter, it doesn’t matter in which order the increments are applied (as long as the counter isn’t read in the same transaction), so the concurrent increments can all be applied without conflicting.

顾名思义,SSI 基于快照隔离,即事务中的所有读取均来自数据库的一致快照(请参阅“快照隔离和可重复读取”)。这是与早期乐观并发控制技术相比的主要区别。除了快照隔离之外,SSI 还添加了一种算法,用于检测写入之间的序列化冲突并确定要中止哪些事务。

As the name suggests, SSI is based on snapshot isolation—that is, all reads within a transaction are made from a consistent snapshot of the database (see “Snapshot Isolation and Repeatable Read”). This is the main difference compared to earlier optimistic concurrency control techniques. On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among writes and determining which transactions to abort.

基于过时前提的决策

Decisions based on an outdated premise

当我们之前讨论快照隔离中的写入偏差时(请参阅“写入偏差和幻像”),我们观察到一种重复出现的模式:事务从数据库读取一些数据,检查查询结果,并决定采取一些操作(写入数据库)基于它看到的结果。但是,在快照隔离下,在事务提交时,原始查询的结果可能不再是最新的,因为数据可能在此期间已被修改。

When we previously discussed write skew in snapshot isolation (see “Write Skew and Phantoms”), we observed a recurring pattern: a transaction reads some data from the database, examines the result of the query, and decides to take some action (write to the database) based on the result that it saw. However, under snapshot isolation, the result from the original query may no longer be up-to-date by the time the transaction commits, because the data may have been modified in the meantime.

换句话说,交易是基于一个前提(交易开始时为真的事实,例如“目前有两名医生待命”)采取行动。后来,当事务要提交时,原始数据可能已经改变了——前提可能不再成立。

Put another way, the transaction is taking an action based on a premise (a fact that was true at the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the transaction wants to commit, the original data may have changed—the premise may no longer be true.

当应用程序进行查询时(例如,“当前有多少医生待命?”),数据库不知道应用程序逻辑如何使用该查询的结果。为了安全起见,数据库需要假设查询结果的任何变化(前提)都意味着该事务中的写入可能无效。换句话说,事务中的查询和写入之间可能存在因果依赖性。为了提供可序列化的隔离,数据库必须检测事务可能在过时的前提下执行的情况,并在这种情况下中止事务。

When the application makes a query (e.g., “How many doctors are currently on call?”), the database doesn’t know how the application logic uses the result of that query. To be safe, the database needs to assume that any change in the query result (the premise) means that writes in that transaction may be invalid. In other words, there may be a causal dependency between the queries and the writes in the transaction. In order to provide serializable isolation, the database must detect situations in which a transaction may have acted on an outdated premise and abort the transaction in that case.

数据库如何知道查询结果是否已更改?有两种情况需要考虑:

How does the database know if a query result might have changed? There are two cases to consider:

  • 检测对过时 MVCC 对象版本的读取(在读取之前发生未提交的写入)

  • Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)

  • 检测影响先前读取的写入(写入发生在读取之后)

  • Detecting writes that affect prior reads (the write occurs after the read)

检测过时的 MVCC 读取

Detecting stale MVCC reads

回想一下,快照隔离通常是通过多版本并发控制(MVCC;见图7-10)来实现的。当事务从 MVCC 数据库中的一致快照读取时,它会忽略在拍摄快照时尚未提交的任何其他事务所做的写入。在图 7-10中,事务 43 将 Alice 视为具有on_call = true,因为事务 42(修改了 Alice 的待命状态)未提交。然而,当事务 43 想要提交时,事务 42 已经提交。这意味着从一致性快照读取时被忽略的写入现在已经生效,事务 43 的前提不再成立。

Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC; see Figure 7-10). When a transaction reads from a consistent snapshot in an MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed at the time when the snapshot was taken. In Figure 7-10, transaction 43 sees Alice as having on_call = true, because transaction 42 (which modified Alice’s on-call status) is uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already committed. This means that the write that was ignored when reading from the consistent snapshot has now taken effect, and transaction 43’s premise is no longer true.

直达0710
图 7-10。检测事务何时从 MVCC快照读取过时的值。

为了防止这种异常情况,数据库需要跟踪一个事务何时由于 MVCC 可见性规则而忽略另一个事务的写入。当事务想要提交时,数据库会检查是否已提交任何被忽略的写入。如果是这样,则必须中止交易。

In order to prevent this anomaly, the database needs to track when a transaction ignores another transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted.

为什么要等到提交?当检测到过时读取时,为什么不立即中止事务 43?那么,如果事务 43 是只读事务,则不需要中止它,因为不存在写入偏差的风险。当事务 43 进行读取时,数据库尚不知道该事务稍后是否要执行写入。此外,在提交事务43时,事务42可能仍中止或仍未提交,因此读取可能最终并未过时。通过避免不必要的中止,SSI 保留了快照隔离对从一致快照进行长时间运行读取的支持。

Why wait until committing? Why not abort transaction 43 immediately when the stale read is detected? Well, if transaction 43 was a read-only transaction, it wouldn’t need to be aborted, because there is no risk of write skew. At the time when transaction 43 makes its read, the database doesn’t yet know whether that transaction is going to later perform a write. Moreover, transaction 42 may yet abort or may still be uncommitted at the time when transaction 43 is committed, and so the read may turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot isolation’s support for long-running reads from a consistent snapshot.

检测影响先前读取的写入

Detecting writes that affect prior reads

要考虑的第二种情况是另一个事务在读取数据后修改数据。这种情况如图7-11所示。

The second case to consider is when another transaction modifies data after it has been read. This case is illustrated in Figure 7-11.

直达0711
图 7-11。在可序列化快照隔离中,检测一个事务何时修改另一事务的读取。

在两阶段锁定的上下文中,我们讨论了索引范围锁(请参阅 “索引范围锁”),它允许数据库锁定对与某些搜索查询匹配的所有行的访问,例如WHERE shift_id = 1234. 我们可以在这里使用类似的技术,只不过 SSI 锁不会阻塞其他事务。

In the context of two-phase locking we discussed index-range locks (see “Index-range locks”), which allow the database to lock access to all rows matching some search query, such as WHERE shift_id = 1234. We can use a similar technique here, except that SSI locks don’t block other transactions.

图7-11中,事务42和43都在轮班期间搜索值班医生1234。如果 上有索引shift_id,则数据库可以使用索引条目 1234 来记录事务 42 和 43 读取此数据的事实。(如果没有索引,可以在表级别跟踪此信息。)此信息只需要保留一段时间:在事务完成(提交或中止)并且所有并发事务完成后,数据库可以忘记它读取了什么数据。

In Figure 7-11, transactions 42 and 43 both search for on-call doctors during shift 1234. If there is an index on shift_id, the database can use the index entry 1234 to record the fact that transactions 42 and 43 read this data. (If there is no index, this information can be tracked at the table level.) This information only needs to be kept for a while: after a transaction has finished (committed or aborted), and all concurrent transactions have finished, the database can forget what data it read.

当事务写入数据库时​​,它必须在索引中查找最近读取了受影响数据的任何其他事务。此过程类似于获取受影响的键范围上的写锁,但该锁不会阻塞直到读取者提交,而是充当绊线:它只是通知事务它们读​​取的数据可能不再是最新的。

When a transaction writes to the database, it must look in the indexes for any other transactions that have recently read the affected data. This process is similar to acquiring a write lock on the affected key range, but rather than blocking until the readers have committed, the lock acts as a tripwire: it simply notifies the transactions that the data they read may no longer be up to date.

图7-11中,事务43通知事务42其先前的读取已过时,反之亦然。事务42最先提交,并且成功:虽然事务43的写影响了42,但43还没有提交,所以写还没有生效。然而,当事务 43 想要提交时,来自 42 的冲突写入已经提交,因此 43 必须中止。

In Figure 7-11, transaction 43 notifies transaction 42 that its prior read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect. However, when transaction 43 wants to commit, the conflicting write from 42 has already been committed, so 43 must abort.

可序列化快照隔离的性能

Performance of serializable snapshot isolation

与往常一样,许多工程细节会影响算法在实践中的运行效果。例如,一种权衡是跟踪事务读取和写入的粒度。如果数据库详细跟踪每个事务的活动,则可以精确地确定哪些事务需要中止,但簿记开销可能会变得很大。不太详细的跟踪速度更快,但可能会导致比严格必要的事务更多的事务被中止。

As always, many engineering details affect how well an algorithm works in practice. For example, one trade-off is the granularity at which transactions’ reads and writes are tracked. If the database keeps track of each transaction’s activity in great detail, it can be precise about which transactions need to abort, but the bookkeeping overhead can become significant. Less detailed tracking is faster, but may lead to more transactions being aborted than strictly necessary.

在某些情况下,一个事务可以读取被另一个事务覆盖的信息:根据发生的其他情况,有时可以证明执行结果仍然是可序列化的。PostgreSQL 使用这个理论来减少不必要的中止次数 [ 11 , 41 ]。

In some cases, it’s okay for a transaction to read information that was overwritten by another transaction: depending on what else happened, it’s sometimes possible to prove that the result of the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of unnecessary aborts [11, 41].

与两阶段锁定相比,可序列化快照隔离的一大优势是一个事务不需要阻塞等待另一事务持有的锁。就像在快照隔离下一样,写入者不会阻止读取者,反之亦然。这种设计原则使查询延迟更加可预测且变化更少。特别是,只读查询可以在一致的快照上运行,而不需要任何锁,这对于读取密集型工作负载非常有吸引力。

Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot isolation, writers don’t block readers, and vice versa. This design principle makes query latency much more predictable and less variable. In particular, read-only queries can run on a consistent snapshot without requiring any locks, which is very appealing for read-heavy workloads.

与串行执行相比,可序列化快照隔离不限于单个CPU核心的吞吐量:FoundationDB将序列化冲突的检测分布在多台机器上,使其能够扩展到非常高的吞吐量。即使数据可能跨多台机器分区,事务也可以在多个分区中读取和写入数据,同时确保可串行隔离[ 54 ]。

Compared to serial execution, serializable snapshot isolation is not limited to the throughput of a single CPU core: FoundationDB distributes the detection of serialization conflicts across multiple machines, allowing it to scale to very high throughput. Even though data may be partitioned across multiple machines, transactions can read and write data in multiple partitions while ensuring serializable isolation [54].

中止率显着影响 SSI 的整体性能。例如,长时间读写数据的事务很可能会遇到冲突并中止,因此SSI要求读写事务相当短(长时间运行的只读事务可能没问题)。然而,与两阶段锁定或串行执行相比,SSI 对慢速事务的敏感度可能较低。

The rate of aborts significantly affects the overall performance of SSI. For example, a transaction that reads and writes data over a long period of time is likely to run into conflicts and abort, so SSI requires that read-write transactions be fairly short (long-running read-only transactions may be okay). However, SSI is probably less sensitive to slow transactions than two-phase locking or serial execution.

概括

Summary

事务是一个抽象层,它允许应用程序假装某些并发问题以及某些类型的硬件和软件故障不存在。一大类错误被简化为简单的事务中止,应用程序只需要重试即可。

Transactions are an abstraction layer that allows an application to pretend that certain concurrency problems and certain kinds of hardware and software faults don’t exist. A large class of errors is reduced down to a simple transaction abort, and the application just needs to try again.

在本章中,我们看到了许多交易有助于预防问题的示例。并非所有应用程序都容易受到所有这些问题的影响:具有非常简单的访问模式(例如仅读取和写入单个记录)的应用程序可能可以在没有事务的情况下进行管理。但是,对于更复杂的访问模式,事务可以极大地减少您需要考虑的潜在错误情况的数量。

In this chapter we saw many examples of problems that transactions help prevent. Not all applications are susceptible to all those problems: an application with very simple access patterns, such as reading and writing only a single record, can probably manage without transactions. However, for more complex access patterns, transactions can hugely reduce the number of potential error cases you need to think about.

如果没有事务,各种错误场景(进程崩溃、网络中断、断电、磁盘已满、意外并发等)意味着数据可能会以各种方式变得不一致。例如,非规范化数据很容易与源数据不同步。如果没有事务,就很难推断复杂的交互访问对数据库可能产生的影响。

Without transactions, various error scenarios (processes crashing, network interruptions, power outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various ways. For example, denormalized data can easily go out of sync with the source data. Without transactions, it becomes very difficult to reason about the effects that complex interacting accesses can have on the database.

在本章中,我们特别深入探讨了并发控制这一主题。我们讨论了几种广泛使用的隔离级别,特别是已提交读快照隔离 (有时称为可重复读)和可序列化。我们通过讨论竞争条件的各种示例来描述这些隔离级别:

In this chapter, we went particularly deep into the topic of concurrency control. We discussed several widely used isolation levels, in particular read committed, snapshot isolation (sometimes called repeatable read), and serializable. We characterized those isolation levels by discussing various examples of race conditions:

脏读
Dirty reads

一个客户端在提交之前读取另一客户端的写入。读已提交隔离级别和更强的级别可防止脏读。

One client reads another client’s writes before they have been committed. The read committed isolation level and stronger levels prevent dirty reads.

脏写
Dirty writes

一个客户端覆盖另一客户端已写入但尚未提交的数据。几乎所有事务实现都会防止脏写。

One client overwrites data that another client has written, but not yet committed. Almost all transaction implementations prevent dirty writes.

读取倾斜(不可重复读取)
Read skew (nonrepeatable reads)

客户端在不同的时间点看到数据库的不同部分。这个问题通常可以通过快照隔离来防止,快照隔离允许事务在一个时间点从一致的快照中读取数据。通常采用多版本并发控制 (MVCC)来实现。

A client sees different parts of the database at different points in time. This issue is most commonly prevented with snapshot isolation, which allows a transaction to read from a consistent snapshot at one point in time. It is usually implemented with multi-version concurrency control (MVCC).

丢失更新
Lost updates

两个客户端同时执行读取-修改-写入周期。其中一个会覆盖另一个的写入而不合并其更改,因此数据会丢失。快照隔离的某些实现会自动防止这种异常,而其他实现则需要手动锁定 ( SELECT FOR UPDATE)。

Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write without incorporating its changes, so data is lost. Some implementations of snapshot isolation prevent this anomaly automatically, while others require a manual lock (SELECT FOR UPDATE).

写入倾斜
Write skew

事务读取某些内容,根据它看到的值做出决策,并将决策写入数据库。然而,当写作完成时,决定的前提就不再成立了。只有可序列化的隔离才能防止这种异常情况。

A transaction reads something, makes a decision based on the value it saw, and writes the decision to the database. However, by the time the write is made, the premise of the decision is no longer true. Only serializable isolation prevents this anomaly.

幻读
Phantom reads

事务读取与某些搜索条件匹配的对象。另一个客户端进行的写入会影响该搜索的结果。快照隔离可以防止直接的幻读,但写倾斜上下文中的幻读需要特殊处理,例如索引范围锁。

A transaction reads objects that match some search condition. Another client makes a write that affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks.

弱隔离级别可以防止其中一些异常,但让您(应用程序开发人员)手动处理其他异常(例如,使用显式锁定)。只有可序列化的隔离才能防止所有这些问题。我们讨论了实现可序列化事务的三种不同方法:

Weak isolation levels protect against some of those anomalies but leave you, the application developer, to handle others manually (e.g., using explicit locking). Only serializable isolation protects against all of these issues. We discussed three different approaches to implementing serializable transactions:

从字面上看,按串行顺序执行事务
Literally executing transactions in a serial order

如果您可以使每个事务的执行速度非常快,并且事务吞吐量足够低,可以在单个 CPU 核心上处理,那么这是一个简单而有效的选择。

If you can make each transaction very fast to execute, and the transaction throughput is low enough to process on a single CPU core, this is a simple and effective option.

两相锁定
Two-phase locking

几十年来,这一直是实现可串行性的标准方法,但许多应用程序由于其性能特征而避免使用它。

For decades this has been the standard way of implementing serializability, but many applications avoid using it because of its performance characteristics.

可串行快照隔离 (SSI)
Serializable snapshot isolation (SSI)

一种相当新的算法,避免了以前方法的大部分缺点。它使用乐观的方法,允许事务在没有阻塞的情况下继续进行。当事务想要提交时,会对其进行检查,如果执行不可序列化,则会中止。

A fairly new algorithm that avoids most of the downsides of the previous approaches. It uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction wants to commit, it is checked, and it is aborted if the execution was not serializable.

本章中的示例使用关系数据模型。然而,正如 “多对象事务的需求”中所讨论的,无论使用哪种数据模型,事务都是一个有价值的数据库特性。

The examples in this chapter used a relational data model. However, as discussed in “The need for multi-object transactions”, transactions are a valuable database feature, no matter which data model is used.

在本章中,我们主要在单机上运行的数据库的背景下探索思想和算法。分布式数据库中的事务带来了一系列新的困难挑战,我们将在接下来的两章中讨论。

In this chapter, we explored ideas and algorithms mostly in the context of a database running on a single machine. Transactions in distributed databases open a new set of difficult challenges, which we’ll discuss in the next two chapters.

脚注

Joe Hellerstein 指出,Härder 和 Reuter 的论文 [ 7 ]中 ACID 中的 C 是“为了使缩写词发挥作用而加入的”,并且当时并不认为它很重要。

i Joe Hellerstein has remarked that the C in ACID was “tossed in to make the acronym work” in Härder and Reuter’s paper [7], and that it wasn’t considered important at the time.

ii可以说,电子邮件应用程序中的错误计数器并不是一个特别严重的问题。或者,考虑客户帐户余额而不是未读计数器,以及支付交易而不是电子邮件。

ii Arguably, an incorrect counter in an email application is not a particularly critical problem. Alternatively, think of a customer account balance instead of an unread counter, and a payment transaction instead of an email.

iii这并不理想。如果 TCP 连接中断,则必须中止事务。如果中断发生在客户端请求提交之后但在服务器确认提交发生之前,客户端不知道事务是否已提交。为了解决此问题,事务管理器可以通过未绑定到特定 TCP 连接的唯一事务标识符对操作进行分组。我们将在“数据库的端到端争论”中回到这个主题。

iii This is not ideal. If the TCP connection is interrupted, the transaction must be aborted. If the interruption happens after the client has requested a commit but before the server acknowledges that the commit happened, the client doesn’t know whether the transaction was committed or not. To solve this issue, a transaction manager can group operations by a unique transaction identifier that is not bound to a particular TCP connection. We will return to this topic in “The End-to-End Argument for Databases”.

iv严格来说,术语“原子增量”在多线程编程的意义上使用“原子”一词在ACID的上下文中,它实际上应该被称为隔离的可序列化的增量。但这变得挑剔了。

iv Strictly speaking, the term atomic increment uses the word atomic in the sense of multi-threaded programming. In the context of ACID, it should actually be called isolated or serializable increment. But that’s getting nitpicky.

v某些数据库支持更弱的隔离级别(称为“未提交读”)。它可以防止脏写,但不能防止脏读。

v Some databases support an even weaker isolation level called read uncommitted. It prevents dirty writes, but does not prevent dirty reads.

vi在撰写本文时,在配置中使用锁进行读提交隔离的主流数据库只有 IBM DB2 和 Microsoft SQL Serverread_committed_snapshot=off[ 23 , 36 ]。

vi At the time of writing, the only mainstream databases that use locks for read committed isolation are IBM DB2 and Microsoft SQL Server in the read_committed_snapshot=off configuration [23, 36].

vii准确地说,交易 ID 是 32 位整数,因此在大约 40 亿次交易后它们会溢出。PostgreSQL 的真空进程执行清理,确保溢出不会影响数据。

vii To be precise, transaction IDs are 32-bit integers, so they overflow after approximately 4 billion transactions. PostgreSQL’s vacuum process performs cleanup which ensures that overflow does not affect the data.

viii尽管相当复杂,但可以将文本文档的编辑表示为原子突变流。有关一些提示,请参阅 “自动冲突解决”

viii It is possible, albeit fairly complicated, to express the editing of a text document as a stream of atomic mutations. See “Automatic Conflict Resolution” for some pointers.

ix在 PostgreSQL 中,您可以使用范围类型更优雅地完成此操作,但它们在其他数据库中并未得到广泛支持。

ix In PostgreSQL you can do this more elegantly using range types, but they are not widely supported in other databases.

x如果事务需要访问不在内存中的数据,最好的解决方案可能是中止事务,将数据异步读取到内存中,同时继续处理其他事务,然后在数据加载后重新启动事务。这种方法称为反缓存,如前面 “将所有内容保留在内存中”中提到的。

x If a transaction needs to access data that’s not in memory, the best solution may be to abort the transaction, asynchronously fetch the data into memory while continuing to process other transactions, and then restart the transaction when the data has been loaded. This approach is known as anti-caching, as previously mentioned in “Keeping everything in memory”.

xi有时称为 强严格两相锁定(SS2PL),以区别于 2PL 的其他变体。

xi Sometimes called strong strict two-phase locking (SS2PL) to distinguish it from other variants of 2PL.

参考

[ 1 ] Donald D. Chamberlin、Morton M. Astrahan、Michael W. Blasgen 等人:“系统 R 的历史和评估”,Communications of the ACM,第 24 卷,第 10 期,第 632-646 页,1981 年 10 月.doi :10.1145/358769.358784

[1] Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, et al.: “A History and Evaluation of System R,” Communications of the ACM, volume 24, number 10, pages 632–646, October 1981. doi:10.1145/358769.358784

[ 2 ] Jim N. Gray、Raymond A. Lorie、Gianfranco R. Putzolu 和 Irving L. Traiger:“共享数据库中锁的粒度和一致性程度”,《数据库管理系统建模:会议论文集》 IFIP 数据库管理系统建模工作会议,由 GM Nijssen 编辑,第 364-394 页,Elsevier/North Holland Publishing,1976 年。同时收录于数据库系统读物,第 4 版,由 Joseph M. Hellerstein 和 Michael Stonebraker 编辑,麻省理工学院出版社,2005 年。ISBN:978-0-262-69314-1

[2] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: “Granularity of Locks and Degrees of Consistency in a Shared Data Base,” in Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in Readings in Database Systems, 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1

[ 3 ] Kapali P. Eswaran、Jim N. Gray、Raymond A. Lorie 和 Irving L. Traiger:“数据库系统中的一致性和谓词锁的概念”,ACM 通信,第 19 卷,第 11 期,第 11 页624–633,1976 年 11 月。

[3] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger: “The Notions of Consistency and Predicate Locks in a Database System,” Communications of the ACM, volume 19, number 11, pages 624–633, November 1976.

[ 4 ]“ ACID 事务非常有用”,FoundationDB, LLC,2013 年。

[4] “ACID Transactions Are Incredibly Helpful,” FoundationDB, LLC, 2013.

[ 5 ] John D. Cook:“数据库事务的 ACID 与 BASE ”,johndcook.com,2009 年 7 月 6 日。

[5] John D. Cook: “ACID Versus BASE for Database Transactions,” johndcook.com, July 6, 2009.

[ 6 ] Gavin Clarke:“ NoSQL 的 CAP 定理克星:我们不会放弃 ACID ”,theregister.co.uk,2012 年 11 月 22 日。

[6] Gavin Clarke: “NoSQL’s CAP Theorem Busters: We Don’t Drop ACID,” theregister.co.uk, November 22, 2012.

[ 7 ] Theo Härder 和 Andreas Reuter:“面向事务的数据库恢复原则”,ACM 计算调查,第 15 卷,第 4 期,第 287-317 页,1983 年 12 月 。doi:10.1145/289.291

[7] Theo Härder and Andreas Reuter: “Principles of Transaction-Oriented Database Recovery,” ACM Computing Surveys, volume 15, number 4, pages 287–317, December 1983. doi:10.1145/289.291

[ 8 ] Peter Bailis、Alan Fekete、Ali Ghodsi 等人:“ HAT,而非 CAP:迈向高可用事务”,第 14 届 USENIX 操作系统热门主题研讨会(HotOS),2013 年 5 月。

[8] Peter Bailis, Alan Fekete, Ali Ghodsi, et al.: “HAT, not CAP: Towards Highly Available Transactions,” at 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2013.

[ 9 ] Armando Fox、Steven D. Gribble、Yatin Chawathe 等人:“基于集群的可扩展网络服务”, 第 16 届 ACM 操作系统原理研讨会(SOSP),1997 年 10 月。

[9] Armando Fox, Steven D. Gribble, Yatin Chawathe, et al.: “Cluster-Based Scalable Network Services,” at 16th ACM Symposium on Operating Systems Principles (SOSP), October 1997.

[ 10 ] Philip A. Bernstein、Vassos Hadzilacos 和 Nathan Goodman: 数据库系统中的并发控制和恢复。Addison-Wesley,1987 年。ISBN:978-0-201-10715-9,可在Research.microsoft.com上在线获取。

[10] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com.

[ 11 ] Alan Fekete、Dimitrios Liarokapis、Elizabeth O'Neil 等人:“使快照隔离可串行化”,ACM Transactions on Database Systems,第 30 卷,第 2 期,第 492–528 页,2005 年 6 月 。doi:10.1145/1071610.1071615

[11] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, et al.: “Making Snapshot Isolation Serializable,” ACM Transactions on Database Systems, volume 30, number 2, pages 492–528, June 2005. doi:10.1145/1071610.1071615

[ 12 ] Mai Cheng、Joseph Tucek、Feng Qing 和 Mark Lillibridge:“了解电源故障下 SSD 的鲁棒性”,第 11 届 USENIX 文件和存储技术会议(FAST),2013 年 2 月。

[12] Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge: “Understanding the Robustness of SSDs Under Power Fault,” at 11th USENIX Conference on File and Storage Technologies (FAST), February 2013.

[ 13 ] Laurie Denness:“ SSD:一份礼物和一份诅咒”, laur.ie,2015 年 6 月 2 日。

[13] Laurie Denness: “SSDs: A Gift and a Curse,” laur.ie, June 2, 2015.

[ 14 ] Adam Surak:“当固态硬盘不够坚固时”,blog.algolia.com,2015 年 6 月 15 日。

[14] Adam Surak: “When Solid State Drives Are Not That Solid,” blog.algolia.com, June 15, 2015.

[ 15 ] Thanumalayan Sankaranarayana Pillai、Vijay Chidambaram、Ramnatthan Alagappan 等人:“所有文件系统并非生而平等:论制作崩溃一致应用程序的复杂性”,第 11 届 USENIX 操作系统设计与实现研讨会(OSDI) ,2014 年 10 月。

[15] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

[ 16 ] Chris Siebenmann:“ Unix 的文件持久性问题”,utcc.utoronto.ca,2016 年 4 月 14 日。

[16] Chris Siebenmann: “Unix’s File Durability Problem,” utcc.utoronto.ca, April 14, 2016.

[ 17 ] Lakshmi N. Bairavasundaram、Garth R. Goodson、Bianca Schroeder 等人:“存储堆栈中数据损坏的分析”,第六届 USENIX 文件和存储技术会议(FAST),2008 年 2 月。

[17] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, et al.: “An Analysis of Data Corruption in the Storage Stack,” at 6th USENIX Conference on File and Storage Technologies (FAST), February 2008.

[ 18 ] Bianca Schroeder、Raghav Lagisetty 和 Arif Merchant:“生产中的闪存可靠性:预期和意外”,第 14 届 USENIX 文件和存储技术会议(FAST),2016 年 2 月。

[18] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant: “Flash Reliability in Production: The Expected and the Unexpected,” at 14th USENIX Conference on File and Storage Technologies (FAST), February 2016.

[ 19 ]Don Allison:“ SSD 存储 – 对技术的无知不是借口”,blog.korelogic.com,2015 年 3 月 24 日。

[19] Don Allison: “SSD Storage – Ignorance of Technology Is No Excuse,” blog.korelogic.com, March 24, 2015.

[ 20 ] Dave Scherer:“那些不是事务 (Cassandra 2.0) ”,blog.foundationdb.com,2013 年 9 月 6 日。

[20] Dave Scherer: “Those Are Not Transactions (Cassandra 2.0),” blog.foundationdb.com, September 6, 2013.

[ 21 ] 凯尔·金斯伯里:“也许叫我:卡桑德拉”, aphyr.com,2013 年 9 月 24 日。

[21] Kyle Kingsbury: “Call Me Maybe: Cassandra,” aphyr.com, September 24, 2013.

[ 22 ]“ Aerospike 中的 ACID 支持”,Aerospike, Inc.,2014 年 6 月。

[22] “ACID Support in Aerospike,” Aerospike, Inc., June 2014.

[ 23 ] Martin Kleppmann:“ Hermitage:在 ACID 中测试‘我’ ”,martin.kleppmann.com,2014 年 11 月 25 日。

[23] Martin Kleppmann: “Hermitage: Testing the ‘I’ in ACID,” martin.kleppmann.com, November 25, 2014.

[ 24 ] Tristan D'Agosta:“ BTC 从 Poloniex 被盗”, bitcointalk.org,2014 年 3 月 4 日。

[24] Tristan D’Agosta: “BTC Stolen from Poloniex,” bitcointalk.org, March 4, 2014.

[ 25 ] bitcointhief2:“我如何从交易所偷了大约 100 BTC,以及我如何偷更多!”,reddit.com,2014 年 2 月 2 日。

[25] bitcointhief2: “How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!,” reddit.com, February 2, 2014.

[ 26 ] Sudhir Jorwekar、Alan Fekete、Krithi Ramamritham 和 S. Sudarshan:“自动检测快照隔离异常”,第 33 届超大型数据库国际会议(VLDB),2007 年 9 月。

[26] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “Automating the Detection of Snapshot Isolation Anomalies,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[ 27 ] Michael Melanson:“交易:隔离的限制”,michaelmelanson.net,2014 年 3 月 20 日。

[27] Michael Melanson: “Transactions: The Limits of Isolation,” michaelmelanson.net, March 20, 2014.

[ 28 ] Hal Berenson、Philip A. Bernstein、Jim N. Gray 等人:“ A Critique of ANSI SQL Isolation Levels ”,ACM 国际数据管理会议(SIGMOD),1995 年 5 月。

[28] Hal Berenson, Philip A. Bernstein, Jim N. Gray, et al.: “A Critique of ANSI SQL Isolation Levels,” at ACM International Conference on Management of Data (SIGMOD), May 1995.

[ 29 ] Atul Adya:“弱一致性:分布式事务的广义理论和乐观实现”,博士论文,麻省理工学院,1999 年 3 月。

[29] Atul Adya: “Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions,” PhD Thesis, Massachusetts Institute of Technology, March 1999.

[ 30 ] Peter Bailis、Aaron Davidson、Alan Fekete 等人:“高可用性事务:优点和局限性(扩展版本) ”,第40 届超大型数据库国际会议 (VLDB),2014 年 9 月。

[30] Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “Highly Available Transactions: Virtues and Limitations (Extended Version),” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.

[ 31 ] Bruce Momjian:“ MVCC Unmasked ”,momjian.us,2014 年 7 月。

[31] Bruce Momjian: “MVCC Unmasked,” momjian.us, July 2014.

[ 32 ] Annamalai Gurusami:“ InnoDB 中的可重复读取隔离级别 – 一致性读取视图的工作原理” , blogs.oracle.com,2013 年 1 月 15 日。

[32] Annamalai Gurusami: “Repeatable Read Isolation Level in InnoDB – How Consistent Read View Works,” blogs.oracle.com, January 15, 2013.

[ 33 ] Nikita Prokopov:“非官方的 Datomic 内部指南”,tonsky.me,2014 年 5 月 6 日。

[33] Nikita Prokopov: “Unofficial Guide to Datomic Internals,” tonsky.me, May 6, 2014.

[ 34 ]Baron Schwartz:“不变性、MVCC 和垃圾收集”,xaprb.com,2013 年 12 月 28 日。

[34] Baron Schwartz: “Immutability, MVCC, and Garbage Collection,” xaprb.com, December 28, 2013.

[ 35 ] J. Chris Anderson、Jan Lehnardt 和 Noah Slater: CouchDB:权威指南。奥莱利媒体,2010。ISBN:978-0-596-15589-6

[35] J. Chris Anderson, Jan Lehnardt, and Noah Slater: CouchDB: The Definitive Guide. O’Reilly Media, 2010. ISBN: 978-0-596-15589-6

[ 36 ] Rikdeb Mukherjee:“ DB2 中的隔离(可重复读、读稳定性、游标稳定性、未提交读)及其示例”, mframes.blogspot.co.uk,2013 年 7 月 4 日。

[36] Rikdeb Mukherjee: “Isolation in DB2 (Repeatable Read, Read Stability, Cursor Stability, Uncommitted Read) with Examples,” mframes.blogspot.co.uk, July 4, 2013.

[ 37 ] Steve Hilker:“光标稳定性 (CS) – IBM DB2 社区”,toadworld.com,2013 年 3 月 14 日。

[37] Steve Hilker: “Cursor Stability (CS) – IBM DB2 Community,” toadworld.com, March 14, 2013.

[ 38 ] Nate Wiger:“原子咆哮”,nateware.com,2010 年 2 月 18 日。

[38] Nate Wiger: “An Atomic Rant,” nateware.com, February 18, 2010.

[ 39 ] Joel Jacobson:“ Riak 2.0:数据类型”, blog.joeljacobson.com,2014 年 3 月 23 日。

[39] Joel Jacobson: “Riak 2.0: Data Types,” blog.joeljacobson.com, March 23, 2014.

[ 40 ] Michael J. Cahill、Uwe Röhm 和 Alan Fekete:“快照数据库的可序列化隔离”,ACM 国际数据管理会议(SIGMOD),2008 年 6 月 。doi:10.1145/1376616.1376690

[40] Michael J. Cahill, Uwe Röhm, and Alan Fekete: “Serializable Isolation for Snapshot Databases,” at ACM International Conference on Management of Data (SIGMOD), June 2008. doi:10.1145/1376616.1376690

[ 41 ] Dan RK Ports 和 Kevin Grittner:“ PostgreSQL 中的可序列化快照隔离”,第 38 届超大型数据库国际会议(VLDB),2012 年 8 月。

[41] Dan R. K. Ports and Kevin Grittner: “Serializable Snapshot Isolation in PostgreSQL,” at 38th International Conference on Very Large Databases (VLDB), August 2012.

[ 42 ] Tony Andrews:“在 Oracle 中实施复​​杂约束”,tonyandrews.blogspot.co.uk,2004 年 10 月 15 日。

[42] Tony Andrews: “Enforcing Complex Constraints in Oracle,” tonyandrews.blogspot.co.uk, October 15, 2004.

[ 43 ] Douglas B. Terry、Marvin M. Theimer、Karin Petersen 等人:“管理 Bayou(弱连接复制存储系统)中的更新冲突”, 第 15 届 ACM 操作系统原理研讨会(SOSP),1995 年 12 月。 号码:10.1145/224056.224070

[43] Douglas B. Terry, Marvin M. Theimer, Karin Petersen, et al.: “Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System,” at 15th ACM Symposium on Operating Systems Principles (SOSP), December 1995. doi:10.1145/224056.224070

[ 44 ] Gary Fredericks:“ Postgres 可串行化错误”,github.com,2015 年 9 月。

[44] Gary Fredericks: “Postgres Serializability Bug,” github.com, September 2015.

[ 45 ] Michael Stonebraker、Samuel Madden、Daniel J. Abadi 等人:“建筑时代的终结(是时候进行彻底重写了) ”,第33 届超大型数据库国际会议(VLDB),2007 年 9 月。

[45] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “The End of an Architectural Era (It’s Time for a Complete Rewrite),” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[ 46 ] John Hugg:“ H-Store/VoltDB 架构与 CEP 系统和较新的流架构”,Data @Scale Boston,2014 年 11 月。

[46] John Hugg: “H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures,” at Data @Scale Boston, November 2014.

[ 47 ] Robert Kallman、Hideaki Kimura、Jonathan Natkins 等人:“ H-Store:高性能分布式主存事务处理系统”,VLDB Endowment 论文集,第 1 卷,第 2 期,第 1496-1499 页,2008 年 8 月。

[47] Robert Kallman, Hideaki Kimura, Jonathan Natkins, et al.: “H-Store: A High-Performance, Distributed Main Memory Transaction Processing System,” Proceedings of the VLDB Endowment, volume 1, number 2, pages 1496–1499, August 2008.

[ 48 ] Rich Hickey:“ Datomic 的架构”,infoq.com,2012 年 11 月 2 日。

[48] Rich Hickey: “The Architecture of Datomic,” infoq.com, November 2, 2012.

[ 49 ] John Hugg:“揭穿有关 VoltDB 内存数据库的神话”,voltdb.com,2014 年 5 月 12 日。

[49] John Hugg: “Debunking Myths About the VoltDB In-Memory Database,” voltdb.com, May 12, 2014.

[ 50 ] Joseph M. Hellerstein、Michael Stonebraker 和 James Hamilton:“数据库系统的架构”, 数据库基础与趋势,第 1 卷,第 2 期,第 141-259 页,2007 年 11 月 。doi:10.1561/1900000002

[50] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “Architecture of a Database System,” Foundations and Trends in Databases, volume 1, number 2, pages 141–259, November 2007. doi:10.1561/1900000002

[ 51 ] Michael J. Cahill:“快照数据库的可序列化隔离”,博士论文,悉尼大学,2009 年 7 月。

[51] Michael J. Cahill: “Serializable Isolation for Snapshot Databases,” PhD Thesis, University of Sydney, July 2009.

[ 52 ] DZ Badal:“分布式数据库中并发控制的正确性和影响”,第 3 届国际 IEEE 计算机软件和应用会议(COMPSAC),1979 年 11 月。

[52] D. Z. Badal: “Correctness of Concurrency Control and Implications in Distributed Databases,” at 3rd International IEEE Computer Software and Applications Conference (COMPSAC), November 1979.

[ 53 ] Rakesh Agrawal、Michael J. Carey 和 Miron Livny:“并发控制性能建模:替代方案和影响”,ACM 数据库系统事务(TODS),第 12 卷,第 4 期,第 609-654 页,1987 年 12 月。 doi :10.1145/32204.32220

[53] Rakesh Agrawal, Michael J. Carey, and Miron Livny: “Concurrency Control Performance Modeling: Alternatives and Implications,” ACM Transactions on Database Systems (TODS), volume 12, number 4, pages 609–654, December 1987. doi:10.1145/32204.32220

[ 54 ] Dave Rosenthal:“ 14.4MHz 的数据库”, blog.foundationdb.com,2014 年 12 月 10 日。

[54] Dave Rosenthal: “Databases at 14.4MHz,” blog.foundationdb.com, December 10, 2014.

第 8 章分布式系统的问题

Chapter 8. The Trouble with Distributed Systems

嘿,我刚刚认识你

网络很慢

但这是我的数据

所以也许可以存储它

凯尔·金斯伯里、卡莉·雷·杰普森和网络分区的危险(2013)

Hey I just met you

The network’s laggy

But here’s my data

So store it maybe

Kyle Kingsbury, Carly Rae Jepsen and the Perils of Network Partitions (2013)

过去几章中反复出现的主题是系统如何处理出现问题的情况。例如,我们讨论了副本故障转移(“处理节点中断”)、复制滞后(“复制滞后问题”)以及事务的并发控制(“弱隔离级别”)。当我们了解实际系统中可能发生的各种边缘情况时,我们就能更好地处理它们。

A recurring theme in the last few chapters has been how systems handle things going wrong. For example, we discussed replica failover (“Handling Node Outages”), replication lag (“Problems with Replication Lag”), and concurrency control for transactions (“Weak Isolation Levels”). As we come to understand various edge cases that can occur in real systems, we get better at handling them.

然而,尽管我们已经谈论了很多错误,但最后几章仍然过于乐观。现实更加黑暗。现在,我们将悲观情绪发挥到最大,并假设任何可能出错的事情都会出错。i(经验丰富的系统操作员会告诉你这是一个合理的假设。如果你好好询问,他们可能会告诉你一些可怕的故事,同时抚平过去战斗的伤疤。)

However, even though we have talked a lot about faults, the last few chapters have still been too optimistic. The reality is even darker. We will now turn our pessimism to the maximum and assume that anything that can go wrong will go wrong.i (Experienced systems operators will tell you that is a reasonable assumption. If you ask nicely, they might tell you some frightening stories while nursing their scars of past battles.)

使用分布式系统与在单台计算机上编写软件有根本的不同,主要区别在于有很多新的、令人兴奋的方式会导致出错 [ 1 , 2 ]。在本章中,我们将体验实践中出现的问题,并了解我们可以依赖和不能依赖的东西。

Working with distributed systems is fundamentally different from writing software on a single computer—and the main difference is that there are lots of new and exciting ways for things to go wrong [1, 2]. In this chapter, we will get a taste of the problems that arise in practice, and an understanding of the things we can and cannot rely on.

最后,作为工程师,我们的任务是构建能够完成其工作的系统(即满足用户期望的保证),尽管一切都会出错。在第 9 章中,我们将研究一些可以在分布式系统中提供此类保证的算法示例。但首先,在本章中,我们必须了解我们面临的挑战。

In the end, our task as engineers is to build systems that do their job (i.e., meet the guarantees that users are expecting), in spite of everything going wrong. In Chapter 9, we will look at some examples of algorithms that can provide such guarantees in a distributed system. But first, in this chapter, we must understand what challenges we are up against.

本章对分布式系统中可能出现的问题进行了彻底悲观和令人沮丧的概述。我们将研究网络问题(“不可靠的网络”);时钟和计时问题(“不可靠的时钟”);我们将讨论它们在多大程度上是可以避免的。所有这些问题的后果都令人迷失方向,因此我们将探索如何思考分布式系统的状态以及如何推理已经发生的事情(“知识、真相和谎言”)。

This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks (“Unreliable Networks”); clocks and timing issues (“Unreliable Clocks”); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a distributed system and how to reason about things that have happened (“Knowledge, Truth, and Lies”).

故障和部分故障

Faults and Partial Failures

当您在一台计算机上编写程序时,它通常会以相当可预测的方式运行:要么有效,要么无效。有缺陷的软件可能会让人觉得计算机有时“遇到了糟糕的一天”(这个问题通常可以通过重新启动来解决),但这大多只是软件编写不当的结果。

When you are writing a program on a single computer, it normally behaves in a fairly predictable way: either it works or it doesn’t. Buggy software may give the appearance that the computer is sometimes “having a bad day” (a problem that is often fixed by a reboot), but that is mostly just a consequence of badly written software.

单台计算机上的软件应该不稳定并没有根本原因:当硬件正常工作时,相同的操作总是产生相同的结果(它是确定性的)。如果存在硬件问题(例如,内存损坏或连接器松动),结果通常是整个系统故障(例如,内核恐慌、“蓝屏死机”、无法启动)。拥有良好软件的个人计算机通常要么功能齐全,要么完全损坏,但不会介于两者之间。

There is no fundamental reason why software on a single computer should be flaky: when the hardware is working correctly, the same operation always produces the same result (it is deterministic). If there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a total system failure (e.g., kernel panic, “blue screen of death,” failure to start up). An individual computer with good software is usually either fully functional or entirely broken, but not something in between.

这是计算机设计中经过深思熟虑的选择:如果发生内部故障,我们宁愿计算机彻底崩溃,也不愿返回错误的结果,因为错误的结果处理起来既困难又混乱。因此,计算机隐藏了它们所运行的模糊物理现实,并呈现了一个以数学完美运行的理想化系统模型。CPU 指令总是做同样的事情;如果您将一些数据写入内存或磁盘,该数据将保持完整并且不会随机损坏。这种始终正确计算的设计目标可以追溯到第一台数字计算机[ 3 ]。

This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a computer to crash completely rather than returning a wrong result, because wrong results are difficult and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are implemented and present an idealized system model that operates with mathematical perfection. A CPU instruction always does the same thing; if you write some data to memory or disk, that data remains intact and doesn’t get randomly corrupted. This design goal of always-correct computation goes all the way back to the very first digital computer [3].

当您编写在通过网络连接的多台计算机上运行的软件时,情况就完全不同了。在分布式系统中,我们不再在理想化的系统模型中运行——我们别无选择,只能面对物理世界的混乱现实。在物理世界中,可能会出现各种各样的问题,正如以下轶事所示 [ 4 ]:

When you are writing software that runs on several computers, connected by a network, the situation is fundamentally different. In distributed systems, we are no longer operating in an idealized system model—we have no choice but to confront the messy reality of the physical world. And in the physical world, a remarkably wide range of things can go wrong, as illustrated by this anecdote [4]:

在我有限的经验中,我处理过单个数据中心 (DC) 中的长期网络分区、PDU [配电单元] 故障、交换机故障、整个机架的意外电源循环、整个 DC 主干故障、整个 DC停电,一名低血糖司机将他的福特皮卡车撞向华盛顿特区的 HVAC(供暖、通风和空调)系统。我什至不是一名行动人员。

科达海尔

In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks, whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even an ops guy.

Coda Hale

在分布式系统中,即使系统的其他部分工作正常,系统的某些部分很可能会以某种不可预测的方式损坏。这称为 部分失败。困难在于部分故障是不确定的:如果您尝试执行涉及多个节点和网络的任何操作,它有时可能会起作用,有时会出现不可预测的失败。正如我们将看到的,您甚至可能不知道某件事是否成功,因为消息在网络上传输所需的时间也是不确定的!

In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine. This is known as a partial failure. The difficulty is that partial failures are nondeterministic: if you try to do anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably fail. As we shall see, you may not even know whether something succeeded or not, as the time it takes for a message to travel across a network is also nondeterministic!

这种不确定性和部分失败的可能性使得分布式系统难以使用[ 5 ]。

This nondeterminism and possibility of partial failures is what makes distributed systems hard to work with [5].

云计算和超级计算

Cloud Computing and Supercomputing

关于如何构建大规模计算系统有一系列的哲学:

There is a spectrum of philosophies on how to build large-scale computing systems:

  • 这一规模的一端是高性能计算(HPC) 领域。拥有数千个 CPU 的超级计算机通常用于计算密集型科学计算任务,例如天气预报或分子动力学(模拟原子和分子的运动)。

  • At one end of the scale is the field of high-performance computing (HPC). Supercomputers with thousands of CPUs are typically used for computationally intensive scientific computing tasks, such as weather forecasting or molecular dynamics (simulating the movement of atoms and molecules).

  • 另一个极端是云计算,它的定义不是很明确[ 6 ],但通常与多租户数据中心、与 IP 网络(通常是以太网)连接的商用计算机、弹性/按需资源分配和计量计费相关。 。

  • At the other extreme is cloud computing, which is not very well defined [6] but is often associated with multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.

  • 传统的企业数据中心介于这两个极端之间。

  • Traditional enterprise datacenters lie somewhere between these extremes.

这些理念带来了截然不同的处理故障的方法。在超级计算机中,作业通常会不时将其计算状态检查点到持久存储。如果一个节点发生故障,一种常见的解决方案是简单地停止整个集群工作负载。故障节点修复后,从最后一个检查点[ 7,8 ]重新开始计算。 因此,超级计算机更像是单节点计算机,而不是分布式系统:它通过让部分故障升级为完全故障来处理部分故障——如果系统的任何部分发生故障,就让一切崩溃(就像单个节点上的内核恐慌一样)机器)。

With these philosophies come very different approaches to handling faults. In a supercomputer, a job typically checkpoints the state of its computation to durable storage from time to time. If one node fails, a common solution is to simply stop the entire cluster workload. After the faulty node is repaired, the computation is restarted from the last checkpoint [7, 8]. Thus, a supercomputer is more like a single-node computer than a distributed system: it deals with partial failure by letting it escalate into total failure—if any part of the system fails, just let everything crash (like a kernel panic on a single machine).

在本书中,我们重点关注用于实现互联网服务的系统,这些系统通常看起来与超级计算机有很大不同:

In this book we focus on systems for implementing internet services, which usually look very different from supercomputers:

  • 许多与互联网相关的应用程序都是在线的,从某种意义上说,它们需要能够随时以低延迟为用户提供服务。使服务不可用(例如,停止集群进行修复)是不可接受的。相比之下,诸如天气模拟之类的离线(批量)作业可以停止并重新启动,而影响相当小。

  • Many internet-related applications are online, in the sense that they need to be able to serve users with low latency at any time. Making the service unavailable—for example, stopping the cluster for repair—is not acceptable. In contrast, offline (batch) jobs like weather simulations can be stopped and restarted with fairly low impact.

  • 超级计算机通常由专用硬件构建,其中每个节点都非常可靠,并且节点通过共享内存和远程直接内存访问(RDMA)进行通信。另一方面,云服务中的节点是由商品机器构建的,由于规模经济,它们可以以较低的成本提供同等的性能,但故障率也较高。

  • Supercomputers are typically built from specialized hardware, where each node is quite reliable, and nodes communicate through shared memory and remote direct memory access (RDMA). On the other hand, nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost due to economies of scale, but also have higher failure rates.

  • 大型数据中心网络通常基于 IP 和以太网,以 Clos 拓扑排列以提供高平分带宽 [ 9 ]。超级计算机通常使用专门的网络拓扑,例如多维网格和环面 [ 10 ],这可以为具有已知通信模式的 HPC 工作负载提供更好的性能。

  • Large datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth [9]. Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses [10], which yield better performance for HPC workloads with known communication patterns.

  • 系统越大,其组件之一损坏的可能性就越大。随着时间的推移,损坏的东西会被修复,新的东西也会损坏,但在具有数千个节点的系统中,可以合理地假设某些东西总是损坏的[ 7 ]。当错误处理策略只是放弃时,大型系统最终可能会花费大量时间从故障中恢复,而不是做有用的工作[ 8 ]。

  • The bigger a system gets, the more likely it is that one of its components is broken. Over time, broken things get fixed and new things break, but in a system with thousands of nodes, it is reasonable to assume that something is always broken [7]. When the error handling strategy consists of simply giving up, a large system can end up spending a lot of its time recovering from faults rather than doing useful work [8].

  • 如果系统能够容忍故障节点,并且仍然保持整体工作,这对于运维来说是一个非常有用的功能:例如,可以进行滚动升级(参见第4章),一次重新启动一个节点 同时服务继续为用户提供服务,不会中断。在云环境中,如果一台虚拟机性能不佳,您可以杀死它并请求一台新的(希望新的速度更快)。

  • If the system can tolerate failed nodes and still keep working as a whole, that is a very useful feature for operations and maintenance: for example, you can perform a rolling upgrade (see Chapter 4), restarting one node at a time, while the service continues serving users without interruption. In cloud environments, if one virtual machine is not performing well, you can just kill it and request a new one (hoping that the new one will be faster).

  • 在地理分布式部署中(使数据在地理位置上靠近用户以减少访问延迟),通信很可能通过互联网进行,与本地网络相比,互联网速度缓慢且不可靠。超级计算机通常假设它们的所有节点都靠近在一起。

  • In a geographically distributed deployment (keeping data geographically close to your users to reduce access latency), communication most likely goes over the internet, which is slow and unreliable compared to local networks. Supercomputers generally assume that all of their nodes are close together.

如果我们想让分布式系统正常工作,我们必须接受部分失败的可能性,并在软件中构建容错机制。换句话说,我们需要用不可靠的组件构建一个可靠的系统。(正如“可靠性”中所讨论的,不存在完美的可靠性,因此我们需要了解我们可以实际承诺的限制。)

If we want to make distributed systems work, we must accept the possibility of partial failure and build fault-tolerance mechanisms into the software. In other words, we need to build a reliable system from unreliable components. (As discussed in “Reliability”, there is no such thing as perfect reliability, so we’ll need to understand the limits of what we can realistically promise.)

即使在仅由几个节点组成的较小系统中,考虑部分故障也很重要。在小型系统中,大多数组件很可能在大多数时间都正常工作。然而,迟早,系统的某些部分出现故障,软件必须以某种方式处理它。故障处理必须是软件设计的一部分,并且您(作为软件的操作员)需要知道在出现故障时软件会出现什么行为。

Even in smaller systems consisting of only a few nodes, it’s important to think about partial failure. In a small system, it’s quite likely that most of the components are working correctly most of the time. However, sooner or later, some part of the system will become faulty, and the software will have to somehow handle it. The fault handling must be part of the software design, and you (as operator of the software) need to know what behavior to expect from the software in the case of a fault.

假设故障很少发生并仅仅希望得到最好的结果是不明智的。重要的是要考虑各种可能的错误(甚至是相当不可能的错误),并在测试环境中人为地创建此类情况以查看会发生什么。在分布式系统中,怀疑、悲观和偏执会带来回报。

It would be unwise to assume that faults are rare and simply hope for the best. It is important to consider a wide range of possible faults—even fairly unlikely ones—and to artificially create such situations in your testing environment to see what happens. In distributed systems, suspicion, pessimism, and paranoia pay off.

不可靠的网络

Unreliable Networks

正如第二部分 的介绍中所讨论的,我们在本书中关注的分布式系统是无共享系统:即通过网络连接的一堆机器。网络是这些机器进行通信的唯一方式——我们假设每台机器都有自己的内存和磁盘,并且一台机器无法访问另一台机器的内存或磁盘(除非通过网络向服务发出请求)。

As discussed in the introduction to Part II, the distributed systems we focus on in this book are shared-nothing systems: i.e., a bunch of machines connected by a network. The network is the only way those machines can communicate—we assume that each machine has its own memory and disk, and one machine cannot access another machine’s memory or disk (except by making requests to a service over the network).

无共享并不是构建系统的唯一方法,但它已成为构建互联网服务的主要方法,原因如下:它相对便宜,因为它不需要特殊的硬件,它可以利用商品化的云计算服务,并且可以通过跨多个地理分布的数据中心的冗余来实现高可靠性。

Shared-nothing is not the only way of building systems, but it has become the dominant approach for building internet services, for several reasons: it’s comparatively cheap because it requires no special hardware, it can make use of commoditized cloud computing services, and it can achieve high reliability through redundancy across multiple geographically distributed datacenters.

互联网和数据中心的大多数内部网络(通常是以太网)都是异步数据包网络。在这种网络中,一个节点可以向另一个节点发送消息(数据包),但网络不保证消息何时到达或是否会到达。如果您发送请求并期望得到响应,则许多情况可能会出错(其中一些如图 8-1所示 ):

The internet and most internal networks in datacenters (often Ethernet) are asynchronous packet networks. In this kind of network, one node can send a message (a packet) to another node, but the network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send a request and expect a response, many things could go wrong (some of which are illustrated in Figure 8-1):

  1. 您的请求可能已丢失(可能是有人拔掉了网线)。

  2. Your request may have been lost (perhaps someone unplugged a network cable).

  3. 您的请求可能正在队列中等待,稍后才会发送(可能是网络或收件人过载)。

  4. Your request may be waiting in a queue and will be delivered later (perhaps the network or the recipient is overloaded).

  5. 远程节点可能发生故障(可能崩溃或断电)。

  6. The remote node may have failed (perhaps it crashed or it was powered down).

  7. 远程节点可能暂时停止响应(可能正在经历长时间的垃圾收集暂停;请参阅“进程暂停”),但稍后它将再次开始响应。

  8. The remote node may have temporarily stopped responding (perhaps it is experiencing a long garbage collection pause; see “Process Pauses”), but it will start responding again later.

  9. 远程节点可能已经处理了您的请求,但响应已在网络上丢失(可能是网络交换机配置错误)。

  10. The remote node may have processed your request, but the response has been lost on the network (perhaps a network switch has been misconfigured).

  11. 远程节点可能已经处理了你的请求,但响应已经延迟,稍后才会传递(可能是网络或你自己的机器过载)。

  12. The remote node may have processed your request, but the response has been delayed and will be delivered later (perhaps the network or your own machine is overloaded).

迪迪亚0801
图 8-1。如果您发送请求但没有得到响应,则无法区分是 (a) 请求丢失、(b) 远程节点已关闭还是 (c) 响应丢失。

发送者甚至无法判断数据包是否已送达:唯一的选择是接收者发送响应消息,而响应消息可能会丢失或延迟。这些问题在异步网络中是无法区分的:您拥有的唯一信息是您尚未收到响应。如果您向另一个节点发送请求但没有收到响应,则无法判断原因。

The sender can’t even tell whether the packet was delivered: the only option is for the recipient to send a response message, which may in turn be lost or delayed. These issues are indistinguishable in an asynchronous network: the only information you have is that you haven’t received a response yet. If you send a request to another node and don’t receive a response, it is impossible to tell why.

处理此问题的常用方法是超时一段时间后,您放弃等待并假设响应不会到达。然而,当超时发生时,你仍然不知道远程节点是否收到了你的请求(如果请求仍然在某个地方排队,它仍然可能被传递给接收者,即使发送者已经放弃了它) )。

The usual way of handling this issue is a timeout: after some time you give up waiting and assume that the response is not going to arrive. However, when a timeout occurs, you still don’t know whether the remote node got your request or not (and if the request is still queued somewhere, it may still be delivered to the recipient, even if the sender has given up on it).

实践中的网络故障

Network Faults in Practice

几十年来,我们一直在建设计算机网络——人们可能希望现在我们已经找到了如何使它们变得可靠的方法。然而,我们似乎还没有成功。

We have been building computer networks for decades—one might hope that by now we would have figured out how to make them reliable. However, it seems that we have not yet succeeded.

有一些系统研究和大量轶事证据表明,即使在一家公司运营的数据中心等受控环境中,网络问题也非常普遍[ 14 ]。一项针对中型数据中心的研究发现,每月大约发生 12 次网络故障,其中一半断开单台机器的连接,一半断开整个机架的连接 [15 ]。另一项研究测量了架顶交换机、聚合交换机和负载平衡器等组件的故障率[ 16 ]。研究发现,添加冗余网络设备并不能像您希望的那样减少故障,因为它不能防止人为错误(例如,错误配置的交换机),而人为错误是造成中断的主要原因。

There are some systematic studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, even in controlled environments like a datacenter operated by one company [14]. One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack [15]. Another study measured the failure rates of components like top-of-rack switches, aggregation switches, and load balancers [16]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages.

EC2 等公共云服务因频繁出现瞬时网络故障而臭名昭著[ 14 ],而管理良好的私有数据中心网络可以提供更稳定的环境。然而,没有人能免受网络问题的影响:例如,交换机软件升级期间的问题可能会触发网络拓扑重新配置,在此期间网络数据包可能会延迟一分钟以上[17 ]。鲨鱼可能会咬住海底电缆并损坏它们[ 18 ]。其他令人惊讶的故障包括网络接口有时会丢弃所有入站数据包,但会成功发送出站数据包[ 19 ]:仅仅因为网络链路在一个方向上工作并不能保证它也在相反方向上工作。

Public cloud services such as EC2 are notorious for having frequent transient network glitches [14], and well-managed private datacenter networks can be stabler environments. Nevertheless, nobody is immune from network problems: for example, a problem during a software upgrade for a switch could trigger a network topology reconfiguration, during which network packets could be delayed for more than a minute [17]. Sharks might bite undersea cables and damage them [18]. Other surprising faults include a network interface that sometimes drops all inbound packets but sends outbound packets successfully [19]: just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.

网络分区

Network partitions

当网络的一部分由于网络故障而与其他部分断开时,有时称为网络分区网络分裂。在本书中,我们通常会使用更通用的术语“ 网络故障” ,以避免与存储系统的分区(分片)混淆,如第 6 章所述 。

When one part of the network is cut off from the rest due to a network fault, that is sometimes called a network partition or netsplit. In this book we’ll generally stick with the more general term network fault, to avoid confusion with partitions (shards) of a storage system, as discussed in Chapter 6.

即使网络故障在您的环境中很少见,但可能发生故障的事实意味着您的软件需要能够处理它们。每当通过网络进行任何通信时,都可能会失败——没有办法绕过它。

Even if network faults are rare in your environment, the fact that faults can occur means that your software needs to be able to handle them. Whenever any communication happens over a network, it may fail—there is no way around it.

如果没有定义和测试网络故障的错误处理,则可能会发生任意糟糕的事情:例如,即使网络恢复[ 20 ],集群也可能陷入死锁并永久无法服务请求,甚至可能删除所有您的数据[ 21 ]。如果软件被置于意想不到的情况下,它可能会做出任意意想不到的事情。

If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests, even when the network recovers [20], or it could even delete all of your data [21]. If software is put in an unanticipated situation, it may do arbitrary unexpected things.

处理网络故障并不一定意味着容忍它们:如果您的网络通常相当可靠,则有效的方法可能是在网络遇到问题时向用户显示错误消息。但是,您确实需要知道您的软件如何应对网络问题并确保系统可以从中恢复。 故意触发网络问题并测试系统的响应可能是有意义的(这是 Chaos Monkey 背后的想法;请参阅“可靠性”)。

Handling network faults doesn’t necessarily mean tolerating them: if your network is normally fairly reliable, a valid approach may be to simply show an error message to users while your network is experiencing problems. However, you do need to know how your software reacts to network problems and ensure that the system can recover from them. It may make sense to deliberately trigger network problems and test the system’s response (this is the idea behind Chaos Monkey; see “Reliability”).

检测故障

Detecting Faults

许多系统需要自动检测故障节点。例如:

Many systems need to automatically detect faulty nodes. For example:

  • 负载均衡器需要停止向已死亡的节点发送请求(即停止轮换)。

  • A load balancer needs to stop sending requests to a node that is dead (i.e., take it out of rotation).

  • 在具有单领导者复制的分布式数据库中,如果领导者发生故障,则需要将其中一个追随者提升为新的领导者(请参阅“处理节点中断”)。

  • In a distributed database with single-leader replication, if the leader fails, one of the followers needs to be promoted to be the new leader (see “Handling Node Outages”).

不幸的是,网络的不确定性使得很难判断节点是否正常工作。在某些特定情况下,您可能会收到一些反馈,明确告诉您某些功能不起作用:

Unfortunately, the uncertainty about the network makes it difficult to tell whether a node is working or not. In some specific circumstances you might get some feedback to explicitly tell you that something is not working:

  • 如果您可以到达应运行节点的计算机,但没有进程正在侦听目标端口(例如,因为进程崩溃),则操作系统将通过发送 或 数据包作为回复来帮助关闭或拒绝 TCPRST连接FIN。但是,如果节点在处理您的请求时崩溃,您将无法知道远程节点实际处理了多少数据[ 22 ]。

  • If you can reach the machine on which the node should be running, but no process is listening on the destination port (e.g., because the process crashed), the operating system will helpfully close or refuse TCP connections by sending a RST or FIN packet in reply. However, if the node crashed while it was handling your request, you have no way of knowing how much data was actually processed by the remote node [22].

  • 如果节点进程崩溃(或被管理员杀死),但该节点的操作系统仍在运行,则脚本可以将崩溃通知其他节点,以便另一个节点可以快速接管,而无需等待超时到期。例如,HBase 就是这样做的[ 23 ]。

  • If a node process crashed (or was killed by an administrator) but the node’s operating system is still running, a script can notify other nodes about the crash so that another node can take over quickly without having to wait for a timeout to expire. For example, HBase does this [23].

  • 如果您有权访问数据中心网络交换机的管理界面,则可以查询它们以检测硬件级别的链路故障(例如,如果远程计算机已关闭)。如果您通过互联网连接,或者您位于无法访问交换机本身的共享数据中心,或者由于网络问题而无法访问管理界面,则排除此选项。

  • If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level (e.g., if the remote machine is powered down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared datacenter with no access to the switches themselves, or if you can’t reach the management interface due to a network problem.

  • 如果路由器确定您尝试连接的 IP 地址无法访问,它可能会使用 ICMP 目标无法访问数据包回复您。然而,路由器也不具备神奇的故障检测能力——它与网络的其他参与者一样受到相同的限制。

  • If a router is sure that the IP address you’re trying to connect to is unreachable, it may reply to you with an ICMP Destination Unreachable packet. However, the router doesn’t have a magic failure detection capability either—it is subject to the same limitations as other participants of the network.

有关远程节点关闭的快速反馈很有用,但您不能指望它。即使 TCP 确认数据包已传送,应用程序也可能在处理它之前就崩溃了。如果您想确保请求成功,则需要应用程序本身的积极响应[ 24 ]。

Rapid feedback about a remote node being down is useful, but you can’t count on it. Even if TCP acknowledges that a packet was delivered, the application may have crashed before handling it. If you want to be sure that a request was successful, you need a positive response from the application itself [24].

相反,如果出现问题,您可能会在堆栈的某个级别收到错误响应,但通常您必须假设您根本不会收到任何响应。您可以重试几次(TCP 重试是透明的,但您也可以在应用程序级别重试),等待超时结束,如果在超时内没有收到回复,则最终声明节点已死亡。

Conversely, if something has gone wrong, you may get an error response at some level of the stack, but in general you have to assume that you will get no response at all. You can retry a few times (TCP retries transparently, but you may also retry at the application level), wait for a timeout to elapse, and eventually declare the node dead if you don’t hear back within the timeout.

超时和无限延迟

Timeouts and Unbounded Delays

如果超时是检测故障的唯一可靠方法,那么超时应该是多长?不幸的是没有简单的答案。

If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There is unfortunately no simple answer.

长时间超时意味着要等待很长时间才能宣布节点死亡(在此期间,用户可能必须等待或看到错误消息)。较短的超时可以更快地检测到故障,但会带来更高的风险,即错误地宣布节点死亡,而实际上该节点仅遭受暂时的减速(例如,由于节点或网络上的负载峰值)。

A long timeout means a long wait until a node is declared dead (and during this time, users may have to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due to a load spike on the node or the network).

过早地宣布节点死亡是有问题的:如果该节点实际上还活着并且正在执行某些操作(例如发送电子邮件),并且另一个节点接管,则该操作最终可能会被执行两次。我们将在 “知识、真理和谎言”以及第 9章和 第11章中更详细地讨论这个问题。

Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of performing some action (for example, sending an email), and another node takes over, the action may end up being performed twice. We will discuss this issue in more detail in “Knowledge, Truth, and Lies”, and in Chapters 9 and 11.

当一个节点被宣告死亡时,其职责需要转移给其他节点,这会给其他节点和网络带来额外的负载。如果系统已经在高负载下挣扎,过早宣布节点死亡会使问题变得更糟。特别是,有可能节点实际上并未死亡,只是由于过载而响应缓慢;将其负载转移到其他节点可能会导致级联故障(在极端情况下,所有节点都宣告彼此死亡,并且一切都停止工作)。

When a node is declared dead, its responsibilities need to be transferred to other nodes, which places additional load on other nodes and the network. If the system is already struggling with high load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen that the node actually wasn’t dead but only slow to respond due to overload; transferring its load to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other dead, and everything stops working).

想象一个虚构的系统,其网络保证数据包的最大延迟 - 每个数据包要么在某个时间d内传递,要么丢失,但传递时间绝不会超过d此外,假设您可以保证非故障节点始终在某个时间r内处理请求。在这种情况下,您可以保证每个成功的请求在 2 d  +  r时间内收到响应- 如果您在这段时间内没有收到响应,您就知道网络或远程节点无法正常工作。如果这是真的,那么 2 d  +  r将是一个合理的超时时间。

Imagine a fictitious system with a network that guaranteed a maximum delay for packets—every packet is either delivered within some time d, or it is lost, but delivery never takes longer than d. Furthermore, assume that you can guarantee that a non-failed node always handles a request within some time r. In this case, you could guarantee that every successful request receives a response within time 2d + r—and if you don’t receive a response within that time, you know that either the network or the remote node is not working. If this was true, 2d + r would be a reasonable timeout to use.

不幸的是,我们使用的大多数系统都没有这些保证:异步网络具有无限的延迟(也就是说,它们尝试尽快传送数据包,但数据包到达所需的时间没有上限) ,并且大多数服务器实现不能保证它们可以在某个最大时间内处理请求(请参阅 “响应时间保证”)。对于故障检测,大多数时候系统的速度不够快:如果超时时间很短,则只需往返时间出现短暂的峰值即可使系统失去平衡。

Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks have unbounded delays (that is, they try to deliver packets as quickly as possible, but there is no upper limit on the time it may take for a packet to arrive), and most server implementations cannot guarantee that they can handle requests within some maximum time (see “Response time guarantees”). For failure detection, it’s not sufficient for the system to be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip times to throw the system off-balance.

网络拥塞和排队

Network congestion and queueing

驾驶汽车时,道路网络上的行驶时间通常会因交通拥堵而变化最大。同样,计算机网络上数据包延迟的变化通常是由于排队造成的[ 25 ]:

When driving a car, travel times on road networks often vary most due to traffic congestion. Similarly, the variability of packet delays on computer networks is most often due to queueing [25]:

  • 如果多个不同的节点同时尝试将数据包发送到同一目的地,则网络交换机必须将它们排队并将它们一一送入目的地网络链路(如图8-2所示。在繁忙的网络链路上,数据包可能需要等待一段时间才能获得插槽(这称为网络拥塞)。如果传入数据过多,交换机队列已满,数据包就会被丢弃,因此需要重新发送——即使网络运行良好。

  • If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one (as illustrated in Figure 8-2). On a busy network link, a packet may have to wait a while until it can get a slot (this is called network congestion). If there is so much incoming data that the switch queue fills up, the packet is dropped, so it needs to be resent—even though the network is functioning fine.

  • 当数据包到达目标计算机时,如果所有 CPU 核心当前都忙,来自网络的传入请求将由操作系统排队,直到应用程序准备好处理它。根据机器上的负载,这可能需要任意长度的时间。

  • When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is queued by the operating system until the application is ready to handle it. Depending on the load on the machine, this may take an arbitrary length of time.

  • 在虚拟化环境中,当另一个虚拟机使用 CPU 核心时,正在运行的操作系统通常会暂停数十毫秒。在此期间,虚拟机无法消耗来自网络的任何数据,因此传入的数据由虚拟机监视器进行排队(缓冲)[ 26 ],进一步增加了网络延迟的可变性。

  • In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered) by the virtual machine monitor [26], further increasing the variability of network delays.

  • TCP 执行流量控制(也称为拥塞避免反压),其中节点限制自己的发送速率,以避免网络链路或接收节点过载[ 27 ]。这意味着在数据进入网络之前,发送方需要进行额外的排队。

  • TCP performs flow control (also known as congestion avoidance or backpressure), in which a node limits its own rate of sending in order to avoid overloading a network link or the receiving node [27]. This means additional queueing at the sender before the data even enters the network.

迪迪亚0802
图 8-2。如果多台计算机将网络流量发送到同一目的地,则其交换机队列可能会填满。这里,端口 1、2 和 4 都尝试将数据包发送到端口 3。

此外,如果在某个超时(根据观察到的往返时间计算)内未确认数据包,TCP 会认为数据包丢失,并且丢失的数据包会自动重传。尽管应用程序看不到数据包丢失和重传,但它确实看到了由此产生的延迟(等待超时到期,然后等待重传的数据包得到确认)。

Moreover, TCP considers a packet to be lost if it is not acknowledged within some timeout (which is calculated from observed round-trip times), and lost packets are automatically retransmitted. Although the application does not see the packet loss and retransmission, it does see the resulting delay (waiting for the timeout to expire, and then waiting for the retransmitted packet to be acknowledged).

所有这些因素都会导致网络延迟的变化。当系统接近其最大容量时,排队延迟的范围特别大:具有大量闲置容量的系统很容易耗尽队列,而在利用率较高的系统中,很容易形成长队列。

All of these factors contribute to the variability of network delays. Queueing delays have an especially wide range when a system is close to its maximum capacity: a system with plenty of spare capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very quickly.

在公共云和多租户数据中心中,资源在许多客户之间共享:网络链路和交换机,甚至每台机器的网络接口和CPU(在虚拟机上运行时)都是共享的。MapReduce(请参阅第 10 章)等批处理工作负载很容易使网络链接饱和。由于您无法控制或了解其他客户对共享资源的使用情况,因此如果您附近的某个人(吵闹的邻居)正在使用大量资源,网络延迟可能会发生很大变化 [ 28 , 29 ]。

In public clouds and multi-tenant datacenters, resources are shared among many customers: the network links and switches, and even each machine’s network interface and CPUs (when running on virtual machines), are shared. Batch workloads such as MapReduce (see Chapter 10) can easily saturate network links. As you have no control over or insight into other customers’ usage of the shared resources, network delays can be highly variable if someone near you (a noisy neighbor) is using a lot of resources [28, 29].

在这种环境中,您只能通过实验选择超时:测量一段较长时间内、在许多机器上的网络往返时间的分布,以确定延迟的预期变化。然后,考虑到应用程序的特征,您可以在故障检测延迟和过早超时的风险之间确定适当的权衡。

In such environments, you can only choose timeouts experimentally: measure the distribution of network round-trip times over an extended period, and over many machines, to determine the expected variability of delays. Then, taking into account your application’s characteristics, you can determine an appropriate trade-off between failure detection delay and risk of premature timeouts.

更好的是,系统可以连续测量响应时间及其变化(抖动),而不是使用配置的恒定超时,并根据观察到的响应时间分布自动调整超时。这可以通过 Phi Accrual 故障检测器 [ 30 ]来完成,该检测器例如在 Akka 和 Cassandra [ 31 ] 中使用。TCP 重传超时的工作原理也类似 [ 27 ]。

Even better, rather than using configured constant timeouts, systems can continually measure response times and their variability (jitter), and automatically adjust timeouts according to the observed response time distribution. This can be done with a Phi Accrual failure detector [30], which is used for example in Akka and Cassandra [31]. TCP retransmission timeouts also work similarly [27].

同步网络与异步网络

Synchronous Versus Asynchronous Networks

如果我们可以依靠网络以固定的最大延迟传送数据包,并且不丢弃数据包,那么分布式系统会简单得多。为什么我们不能从硬件层面解决这个问题,让网络变得可靠,这样软件就不需要操心呢?

Distributed systems would be a lot simpler if we could rely on the network to deliver packets with some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level and make the network reliable so that the software doesn’t need to worry about it?

为了回答这个问题,将数据中心网络与传统的固定电话网络(非蜂窝、非 VoIP)进行比较是很有趣的,传统的固定电话网络非常可靠:音频帧延迟和掉话的情况非常罕见。电话呼叫需要持续较低的端到端延迟和足够的带宽来传输语音的音频样本。在计算机网络中拥有类似的可靠性和可预测性不是很好吗?

To answer this question, it’s interesting to compare datacenter networks to the traditional fixed-line telephone network (non-cellular, non-VoIP), which is extremely reliable: delayed audio frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency and enough bandwidth to transfer the audio samples of your voice. Wouldn’t it be nice to have similar reliability and predictability in computer networks?

当您通过电话网络拨打电话时,它会建立一条电路:沿着两个呼叫者之间的整个路线为呼叫分配固定的、有保证的带宽量。该电路保持原位直到呼叫结束[ 32 ]。例如,ISDN 网络以每秒 4,000 帧的固定速率运行。当呼叫建立时,会在每个帧内(每个方向)分配 16 位空间。因此,在通话期间,每一方都保证能够每 250 微秒发送 16 位音频数据 [ 33 , 34 ]。

When you make a call over the telephone network, it establishes a circuit: a fixed, guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. This circuit remains in place until the call ends [32]. For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every 250 microseconds [33, 34].

这种网络是同步的:即使数据经过多个路由器,也不会受到排队的影响,因为用于呼叫的16位空间已经在网络的下一跳中预留。而且由于没有排队,网络的最大端到端延迟是固定的。我们称之为有界延迟

This kind of network is synchronous: even as data passes through several routers, it does not suffer from queueing, because the 16 bits of space for the call have already been reserved in the next hop of the network. And because there is no queueing, the maximum end-to-end latency of the network is fixed. We call this a bounded delay.

我们不能简单地让网络延迟变得可预测吗?

Can we not simply make network delays predictable?

请注意,电话网络中的电路与 TCP 连接有很大不同:电路是固定数量的保留带宽,在建立电路时没有其他人可以使用该带宽,而 TCP 连接的数据包则机会性地使用任何可用的网络带宽。你可以给TCP一个可变大小的数据块(例如,一封电子邮件或一个网页),它会尝试在尽可能短的时间内传输它。当 TCP 连接空闲时,它不使用任何带宽。二、

Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a fixed amount of reserved bandwidth which nobody else can use while the circuit is established, whereas the packets of a TCP connection opportunistically use whatever network bandwidth is available. You can give TCP a variable-sized block of data (e.g., an email or a web page), and it will try to transfer it in the shortest time possible. While a TCP connection is idle, it doesn’t use any bandwidth.ii

如果数据中心网络和互联网是电路交换网络,则可以在建立电路时建立有保证的最大往返时间。但事实并非如此:以太网和 IP 都是数据包交换协议,它们会受到排队的影响,从而导致网络中出现无限制的延迟。这些协议没有电路的概念。

If datacenter networks and the internet were circuit-switched networks, it would be possible to establish a guaranteed maximum round-trip time when a circuit was set up. However, they are not: Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays in the network. These protocols do not have the concept of a circuit.

为什么数据中心网络和互联网使用数据包交换?答案是它们针对突发流量进行了优化。电路适用于音频或视频呼叫,在呼叫期间每秒需要传输相当恒定的位数。另一方面,请求网页、发送电子邮件或传输文件没有任何特定的带宽要求 - 我们只是希望它尽快完成。

Why do datacenter networks and the internet use packet switching? The answer is that they are optimized for bursty traffic. A circuit is good for an audio or video call, which needs to transfer a fairly constant number of bits per second for the duration of the call. On the other hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular bandwidth requirement—we just want it to complete as quickly as possible.

如果您想通过电路传输文件,则必须猜测带宽分配。如果您猜测太低,则传输速度会过慢,从而导致网络容量未得到利用。如果猜测太高,则无法建立电路(因为如果不能保证其带宽分配,网络就不允许创建电路)。因此,使用电路进行突发数据传输会浪费网络容量并使传输速度不必要地变慢。相比之下,TCP 会根据可用网络容量动态调整数据传输速率。

If you wanted to transfer a file over a circuit, you would have to guess a bandwidth allocation. If you guess too low, the transfer is unnecessarily slow, leaving network capacity unused. If you guess too high, the circuit cannot be set up (because the network cannot allow a circuit to be created if its bandwidth allocation cannot be guaranteed). Thus, using circuits for bursty data transfers wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts the rate of data transfer to the available network capacity.

已经有人尝试建立同时支持电路交换和分组交换的混合网络,例如ATM。iii InfiniBand 有一些相似之处[ 35 ]:它在链路层实现端到端流量控制,这减少了网络中排队的需要,尽管它仍然会因链路拥塞而遭受延迟[36 ]。通过仔细使用服务质量(QoS、数据包的优先级和调度)和准入控制(速率限制发送者),可以模拟数据包网络上的电路交换,或提供统计上有界的延迟 [ 25 , 32 ]。

There have been some attempts to build hybrid networks that support both circuit switching and packet switching, such as ATM.iii InfiniBand has some similarities [35]: it implements end-to-end flow control at the link layer, which reduces the need for queueing in the network, although it can still suffer from delays due to link congestion [36]. With careful use of quality of service (QoS, prioritization and scheduling of packets) and admission control (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or provide statistically bounded delay [25, 32].

然而,目前在多租户数据中心和公共云中或通过互联网通信时还无法实现这种服务质量。iv 当前部署的技术不允许我们对网络的延迟或可靠性做出任何保证:我们必须假设网络拥塞、排队和无限延迟将会发生。因此,超时没有“正确”的值——它们需要通过实验来确定。

However, such quality of service is currently not enabled in multi-tenant datacenters and public clouds, or when communicating via the internet.iv Currently deployed technology does not allow us to make any guarantees about delays or reliability of the network: we have to assume that network congestion, queueing, and unbounded delays will happen. Consequently, there’s no “correct” value for timeouts—they need to be determined experimentally.

不可靠的时钟

Unreliable Clocks

时钟和时间很重要。应用程序以各种方式依赖时钟来回答如下问题:

Clocks and time are important. Applications depend on clocks in various ways to answer questions like the following:

  1. 该请求是否已超时?

  2. Has this request timed out yet?

  3. 该服务的 99% 响应时间是多少?

  4. What’s the 99th percentile response time of this service?

  5. 在过去五分钟内,该服务平均每秒处理多少个查询?

  6. How many queries per second did this service handle on average in the last five minutes?

  7. 用户在我们的网站上花费了多长时间?

  8. How long did the user spend on our site?

  9. 这篇文章什么时候发表的?

  10. When was this article published?

  11. 应在什么日期和时间发送提醒电子邮件?

  12. At what date and time should the reminder email be sent?

  13. 这个缓存条目什么时候过期?

  14. When does this cache entry expire?

  15. 日志文件中此错误消息的时间戳是什么?

  16. What is the timestamp on this error message in the log file?

示例 1-4 测量持续时间(例如,发送请求和接收响应之间的时间间隔),而示例 5-8 描述时间点(在特定日期、特定时间发生的事件)。

Examples 1–4 measure durations (e.g., the time interval between a request being sent and a response being received), whereas examples 5–8 describe points in time (events that occur on a particular date, at a particular time).

在分布式系统中,时间是一件棘手的事情,因为通信不是即时的:消息通过网络从一台机器传输到另一台机器需要时间。接收消息的时间总是晚于发送消息的时间,但由于网络延迟的变化,我们不知道晚了多少时间。这一事实有时使得当涉及多台机器时很难确定事情发生的顺序。

In a distributed system, time is a tricky business, because communication is not instantaneous: it takes time for a message to travel across the network from one machine to another. The time when a message is received is always later than the time when it is sent, but due to variable delays in the network, we don’t know how much later. This fact sometimes makes it difficult to determine the order in which things happened when multiple machines are involved.

而且,网络上的每台机器都有自己的时钟,这是一个实际的硬件设备:通常是石英晶体振荡器。这些设备并不完全准确,因此每台机器都有自己的时间概念,可能比其他机器稍快或稍慢。在某种程度上同步时钟是可能的:最常用的机制是网络时间协议(NTP),它允许计算机时钟根据一组服务器报告的时间进行调整[37 ]。服务器反过来从更准确的时间源(例如 GPS 接收器)获取时间。

Moreover, each machine on the network has its own clock, which is an actual hardware device: usually a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own notion of time, which may be slightly faster or slower than on other machines. It is possible to synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which allows the computer clock to be adjusted according to the time reported by a group of servers [37]. The servers in turn get their time from a more accurate time source, such as a GPS receiver.

单调时钟与时钟

Monotonic Versus Time-of-Day Clocks

现代计算机至少有两种不同类型的时钟:时钟单调时钟。尽管它们都测量时间,但区分两者很重要,因为它们有不同的用途。

Modern computers have at least two different kinds of clocks: a time-of-day clock and a monotonic clock. Although they both measure time, it is important to distinguish the two, since they serve different purposes.

时钟

Time-of-day clocks

时钟的功能符合您对时钟的直观期望:它根据某些日历返回当前日期和时间(也称为挂钟时间)。例如, clock_gettime(CLOCK_REALTIME)在 Linux v和 Java 中,根据公历返回自 1970 年 1 月 1 日午夜 UTC 以来的秒数(或毫秒),不包括闰System.currentTimeMillis()。 有些系统使用其他日期作为参考点。

A time-of-day clock does what you intuitively expect of a clock: it returns the current date and time according to some calendar (also known as wall-clock time). For example, clock_gettime(CLOCK_REALTIME) on Linuxv and System.currentTimeMillis() in Java return the number of seconds (or milliseconds) since the epoch: midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap seconds. Some systems use other dates as their reference point.

时钟通常与 NTP 同步,这意味着一台机器上的时间戳(理想情况下)与另一台机器上的时间戳相同。然而,时钟也有各种奇怪之处,如下一节所述。特别是,如果本地时钟比 NTP 服务器超前太远,则可能会被强制重置并看起来跳回到之前的时间点。这些跳跃以及它们经常忽略闰秒的事实使得时钟不适合测量经过的时间[ 38 ]。

Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine (ideally) means the same as a timestamp on another machine. However, time-of-day clocks also have various oddities, as described in the next section. In particular, if the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps, as well as the fact that they often ignore leap seconds, make time-of-day clocks unsuitable for measuring elapsed time [38].

历史上,时钟也具有相当粗粒度的分辨率,例如,在较旧的 Windows 系统上以 10 毫秒的步长前进[ 39 ]。在最近的系统上,这不是什么问题。

Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward in steps of 10 ms on older Windows systems [39]. On recent systems, this is less of a problem.

单调时钟

Monotonic clocks

单调时钟适合测量持续时间(时间间隔),例如超时或服务的响应时间:例如,clock_gettime(CLOCK_MONOTONIC)在 Linux 和 Java 中都是单调时钟。System.nanoTime()这个名字来源于这样一个事实:它们保证总是向前移动(而时钟可能会及时跳回)。

A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a service’s response time: clock_gettime(CLOCK_MONOTONIC) on Linux and System.nanoTime() in Java are monotonic clocks, for example. The name comes from the fact that they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).

您可以在某个时间点检查单调时钟的值,执行某些操作,然后稍后再次检查时钟。两个值之间的差异告诉您两次检查之间经过了多少时间。然而,时钟的绝对值是没有意义的:它可能是自计算机启动以来的纳秒数,或者类似的任意值特别是,比较来自两台不同计算机的单调时钟值是没有意义的,因为它们并不意味着相同的事情。

You can check the value of the monotonic clock at one point in time, do something, and then check the clock again at a later time. The difference between the two values tells you how much time elapsed between the two checks. However, the absolute value of the clock is meaningless: it might be the number of nanoseconds since the computer was started, or something similarly arbitrary. In particular, it makes no sense to compare monotonic clock values from two different computers, because they don’t mean the same thing.

在具有多个 CPU 插槽的服务器上,每个 CPU 可能有一个单独的计时器,该计时器不一定与其他 CPU 同步。操作系统会补偿任何差异,并尝试向应用程序线程提供时钟的单调视图,即使它们是在不同的 CPU 上调度的。然而,明智的做法是对这种单调性的保证持保留态度[ 40 ]。

On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not necessarily synchronized with other CPUs. Operating systems compensate for any discrepancy and try to present a monotonic view of the clock to application threads, even as they are scheduled across different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt [40].

如果NTP 检测到计算机的本地石英比 NTP 服务器移动得更快或更慢,则它可能会调整单调时钟向前移动的频率(这称为旋转 时钟)。默认情况下,NTP 允许时钟速率加快或减慢最多 0.05%,但 NTP 不能导致单调时钟向前或向后跳跃。单调时钟的分辨率通常非常好:在大多数系统上,它们可以测量微秒或更短的时间间隔。

NTP may adjust the frequency at which the monotonic clock moves forward (this is known as slewing the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP server. By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but NTP cannot cause the monotonic clock to jump forward or backward. The resolution of monotonic clocks is usually quite good: on most systems they can measure time intervals in microseconds or less.

在分布式系统中,使用单调时钟来测量经过的时间(例如超时)通常很好,因为它不假设不同节点的时钟之间有任何同步,并且对测量的轻微误差不敏感。

In a distributed system, using a monotonic clock for measuring elapsed time (e.g., timeouts) is usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is not sensitive to slight inaccuracies of measurement.

时钟同步和精度

Clock Synchronization and Accuracy

单调时钟不需要同步,但时钟需要根据 NTP 服务器或其他外部时间源进行设置才能发挥作用。不幸的是,我们让时钟说出正确时间的方法并不像您希望的那样可靠或准确——硬件时钟和 NTP 可能是变化无常的野兽。举几个例子:

Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful. Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope—hardware clocks and NTP can be fickle beasts. To give just a few examples:

  • 计算机中的石英钟不是很准确:它会漂移(比应有的速度更快或更慢)。时钟漂移根据机器的温度而变化。Google 假设其服务器的时钟漂移为 200 ppm(百万分之一)[ 41 ],这相当于每 30 秒与服务器重新同步的时钟漂移 6 毫秒,或者重新同步的时钟漂移 17 秒一天一次。即使一切正常,这种漂移也会限制您所能达到的最佳精度。

  • The quartz clock in a computer is not very accurate: it drifts (runs faster or slower than it should). Clock drift varies depending on the temperature of the machine. Google assumes a clock drift of 200 ppm (parts per million) for its servers [41], which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best possible accuracy you can achieve, even if everything is working correctly.

  • 如果计算机的时钟与NTP服务器相差太大,它可能会拒绝同步,或者本地时钟将被强制重置[ 37 ]。任何观察此重置前后时间的应用程序都可能会看到时间倒退或突然向前跳跃。

  • If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset [37]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.

  • 如果某个节点意外地被防火墙与 NTP 服务器隔离,则错误配置可能会在一段时间内被忽视。轶事证据表明,这种情况在实践中确实发生过。

  • If a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time. Anecdotal evidence suggests that this does happen in practice.

  • NTP 同步的效果取决于网络延迟,因此当您处于数据包延迟可变的拥塞网络上时,其准确性会受到限制。一项实验表明,通过互联网进行同步时可实现 35 毫秒的最小误差 [ 42 ],尽管网络延迟的偶尔峰值会导致大约一秒的误差。根据配置,较大的网络延迟可能会导致 NTP 客户端完全放弃。

  • NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when you’re on a congested network with variable packet delays. One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet [42], though occasional spikes in network delay lead to errors of around a second. Depending on the configuration, large network delays can cause the NTP client to give up entirely.

  • 一些 NTP 服务器错误或配置错误,报告的时间有几个小时的偏差 [ 43 , 44 ]。NTP 客户端非常强大,因为它们查询多个服务器并忽略异常值。尽管如此,将你的系统的正确性押在互联网上的陌生人告诉你的时间上还是有点令人担忧。

  • Some NTP servers are wrong or misconfigured, reporting time that is off by hours [43, 44]. NTP clients are quite robust, because they query several servers and ignore outliers. Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you were told by a stranger on the internet.

  • 闰秒导致一分钟长 59 秒或 61 秒,这会扰乱未设计考虑闰秒的系统中的计时假设 [45 ]闰秒已经导致许多大型系统崩溃 [ 38 , 46 ] 的事实表明,关于时钟的错误假设是多么容易潜入系统。处理闰秒的最佳方法可能是让 NTP 服务器“撒谎”,在一天的过程中逐渐执行闰秒调整(这称为涂抹 [ 47 , 48 ],尽管实际的 NTP 服务器行为在不同方面有所不同。实践[ 49 ]。

  • Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing assumptions in systems that are not designed with leap seconds in mind [45]. The fact that leap seconds have crashed many large systems [38, 46] shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second adjustment gradually over the course of a day (this is known as smearing) [47, 48], although actual NTP server behavior varies in practice [49].

  • 在虚拟机中,硬件时钟被虚拟化,这给需要精确计时的应用程序带来了额外的挑战[ 50 ]。当虚拟机之间共享 CPU 核心时,每个虚拟机都会暂停数十毫秒,而另一个虚拟机正在运行。从应用程序的角度来看,这种暂停表现为时钟突然向前跳跃[ 26 ]。

  • In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [50]. When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as the clock suddenly jumping forward [26].

  • 如果您在无法完全控制的设备(例如移动或嵌入式设备)上运行软件,您可能根本无法信任设备的硬件时钟。一些用户故意将硬件时钟设置为不正确的日期和时间,例如为了规避游戏中的计时限制。因此,时钟可能会被设置为过去或未来的时间。

  • If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you probably cannot trust the device’s hardware clock at all. Some users deliberately set their hardware clock to an incorrect date and time, for example to circumvent timing limitations in games. As a result, the clock might be set to a time wildly in the past or the future.

如果您足够关心并投入大量资源,则可以获得非常好的时钟精度。例如,MiFID II 欧洲金融机构监管草案要求所有高频交易基金将其时钟同步到 UTC 100 微秒以内,以帮助调试“闪崩”等市场异常情况并帮助检测市场操纵行为。51 ]。

It is possible to achieve very good clock accuracy if you care about it sufficiently to invest significant resources. For example, the MiFID II draft European regulation for financial institutions requires all high-frequency trading funds to synchronize their clocks to within 100 microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help detect market manipulation [51].

这种精度可以通过使用 GPS 接收器、精确时间协议(PTP)[ 52 ]以及仔细的部署和监控 来实现。然而,它需要大量的努力和专业知识,并且时钟同步有很多原因可能会出错。如果您的 NTP 守护程序配置错误,或者防火墙阻止 NTP 流量,则由于漂移而导致的时钟误差很快就会变大。

Such accuracy can be achieved using GPS receivers, the Precision Time Protocol (PTP) [52], and careful deployment and monitoring. However, it requires significant effort and expertise, and there are plenty of ways clock synchronization can go wrong. If your NTP daemon is misconfigured, or a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.

依赖同步时钟

Relying on Synchronized Clocks

时钟的问题在于,虽然它们看起来简单易用,但它们存在数量惊人的陷阱:一天可能不精确为 86,400 秒,时钟的时间可能会向后移动,以及一个节点上的时间可能与另一个节点上的时间有很大不同。

The problem with clocks is that while they seem simple and easy to use, they have a surprising number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward in time, and the time on one node may be quite different from the time on another node.

在本章前面,我们讨论了网络丢弃和任意延迟数据包的情况。尽管网络在大多数情况下表现良好,但软件的设计必须假设网络偶尔会出现故障,并且软件必须妥善处理此类故障。时钟也是如此:尽管它们在大多数情况下工作得很好,但需要强大的软件来处理不正确的时钟。

Earlier in this chapter we discussed networks dropping and arbitrarily delaying packets. Even though networks are well behaved most of the time, software must be designed on the assumption that the network will occasionally be faulty, and the software must handle such faults gracefully. The same is true with clocks: although they work quite well most of the time, robust software needs to be prepared to deal with incorrect clocks.

部分问题在于,不正确的时钟很容易被忽视。如果一台机器的CPU有缺陷或者网络配置错误,它很可能根本无法工作,所以它很快就会被注意到并修复。另一方面,如果它的石英时钟有缺陷或者它的 NTP 客户端配置错误,大多数事情看起来都会工作正常,尽管它的时钟逐渐偏离现实。如果某些软件依赖于精确同步的时钟,则结果更有可能是无声无息且细微的数据丢失,而不是戏剧性的崩溃 [ 53 , 54 ]。

Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most things will seem to work fine, even though its clock gradually drifts further and further away from reality. If some piece of software is relying on an accurately synchronized clock, the result is more likely to be silent and subtle data loss than a dramatic crash [53, 54].

因此,如果您使用需要同步时钟的软件,则还必须仔细监视所有机器之间的时钟偏移。任何时钟与其他节点漂移太远的节点都应该被宣布死亡并从集群中删除。此类监控可确保您在损坏的时钟造成太大损坏之前注意到它们。

Thus, if you use software that requires synchronized clocks, it is essential that you also carefully monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declared dead and removed from the cluster. Such monitoring ensures that you notice the broken clocks before they can cause too much damage.

排序事件的时间戳

Timestamps for ordering events

让我们考虑一种特定的情况,在这种情况下,依赖时钟很诱人,但很危险:跨多个节点的事件排序。例如,如果两个客户端写入分布式数据库,谁先到达那里?哪一篇文章是最新的?

Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks: ordering of events across multiple nodes. For example, if two clients write to a distributed database, who got there first? Which write is the more recent one?

图 8-3说明了在具有多主复制的数据库中使用时钟的危险情况(该示例类似于图 5-9)。 客户端A在节点1上写入 x =1;写入被复制到节点 3;客户端 B 在节点 3 上递增x(现在x  = 2);最后,两次写入都被复制到节点 2。

Figure 8-3 illustrates a dangerous use of time-of-day clocks in a database with multi-leader replication (the example is similar to Figure 5-9). Client A writes x = 1 on node 1; the write is replicated to node 3; client B increments x on node 3 (we now have x = 2); and finally, both writes are replicated to node 2.

迪迪亚0803
图 8-3。客户端 B 的写入因果上晚于客户端 A 的写入,但 B 的写入具有更早的时间戳。

图 8-3中,当写入被复制到其他节点时,根据写入发起的节点上的时钟,它会被标记一个时间戳。本例中的时钟同步非常好:节点 1 和节点 3 之间的偏差小于 3 毫秒,这可能比您在实践中预期的要好。

In Figure 8-3, when a write is replicated to other nodes, it is tagged with a timestamp according to the time-of-day clock on the node where the write originated. The clock synchronization is very good in this example: the skew between node 1 and node 3 is less than 3 ms, which is probably better than you can expect in practice.

然而,图 8-3中的时间戳无法正确对事件进行排序:写入x  = 1 的时间戳为 42.004 秒,但写入x  = 2 的时间戳为 42.003 秒,即使x  = 2 明确晚于发生。当节点 2 收到这两个事件时,它将错误地断定x  = 1 是较新的值,并丢弃写入x  = 2。实际上,客户端 B 的增量操作将丢失。

Nevertheless, the timestamps in Figure 8-3 fail to order the events correctly: the write x = 1 has a timestamp of 42.004 seconds, but the write x = 2 has a timestamp of 42.003 seconds, even though x = 2 occurred unambiguously later. When node 2 receives these two events, it will incorrectly conclude that x = 1 is the more recent value and drop the write x = 2. In effect, client B’s increment operation will be lost.

这种冲突解决策略称为最后写入获胜(LWW),它广泛应用于多领导者复制和无领导者数据库,例如 Cassandra [ 53 ] 和 Riak [ 54 ](请参阅 “最后写入获胜(丢弃并发写入)”))。一些实现在客户端而不是服务器上生成时间戳,但这并没有改变 LWW 的基本问题:

This conflict resolution strategy is called last write wins (LWW), and it is widely used in both multi-leader replication and leaderless databases such as Cassandra [53] and Riak [54] (see “Last write wins (discarding concurrent writes)”). Some implementations generate timestamps on the client rather than the server, but this doesn’t change the fundamental problems with LWW:

  • 数据库写入可能会神秘地消失:具有滞后时钟的节点无法覆盖具有快速时钟的节点先前写入的值,直到节点之间的时钟偏差过去[ 54 , 55 ]。这种情况可能会导致任意数量的数据被静默删除,而不会向应用程序报告任何错误。

  • Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite values previously written by a node with a fast clock until the clock skew between the nodes has elapsed [54, 55]. This scenario can cause arbitrary amounts of data to be silently dropped without any error being reported to the application.

  • LWW 无法区分快速连续发生的写入(在 图 8-3中,客户端 B 的增量肯定发生客户端 A 的写入之后)和真正并发的写入(写入者都不知道对方)。为了防止违反因果关系,需要额外的因果关系跟踪机制,例如版本向量(请参阅“检测并发写入”)​​。

  • LWW cannot distinguish between writes that occurred sequentially in quick succession (in Figure 8-3, client B’s increment definitely occurs after client A’s write) and writes that were truly concurrent (neither writer was aware of the other). Additional causality tracking mechanisms, such as version vectors, are needed in order to prevent violations of causality (see “Detecting Concurrent Writes”).

  • 两个节点可以独立生成具有相同时间戳的写入,特别是当时钟只有毫秒分辨率时。需要额外的决胜值(可以是一个大的随机数)来解决此类冲突,但这种方法也可能导致违反因果关系[ 53 ]。

  • It is possible for two nodes to independently generate writes with the same timestamp, especially when the clock only has millisecond resolution. An additional tiebreaker value (which can simply be a large random number) is required to resolve such conflicts, but this approach can also lead to violations of causality [53].

因此,尽管通过保留最新的值并丢弃其他值来解决冲突很诱人,但重要的是要意识到“最近”的定义取决于本地时钟,这很可能是不正确的。即使使用严格的 NTP 同步时钟,您也可以在时间戳 100 毫秒(根据发送者的时钟)发送数据包,并使其在时间戳 99 毫秒(根据接收者的时钟)到达,因此看起来好像数据包先于它到达已发送,这是不可能的。

Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and discarding others, it’s important to be aware that the definition of “recent” depends on a local time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet arrived before it was sent, which is impossible.

能否使 NTP 同步足够准确以防止出现此类错误排序?可能不会,因为除了石英漂移等其他误差源之外,NTP 的同步精度本身还受到网络往返时间的限制。为了正确排序,您需要时钟源比您正在测量的东西(即网络延迟)更加准确。

Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur? Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip time, in addition to other sources of error such as quartz drift. For correct ordering, you would need the clock source to be significantly more accurate than the thing you are measuring (namely network delay).

所谓的逻辑时钟 [ 56 , 57 ] 基于递增计数器而不是振荡石英晶体,是排序事件的更安全的替代方案(请参阅“检测并发写入”)​​。逻辑时钟不测量一天中的时间或经过的秒数,仅测量事件的相对顺序(无论一个事件发生在另一事件之前还是之后)。相反,测量实际经过时间的日时钟和单调时钟也称为物理时钟我们将在“订购保证”中讨论更多订购内容。

So-called logical clocks [56, 57], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer alternative for ordering events (see “Detecting Concurrent Writes”). Logical clocks do not measure the time of day or the number of seconds elapsed, only the relative ordering of events (whether one event happened before or after another). In contrast, time-of-day and monotonic clocks, which measure actual elapsed time, are also known as physical clocks. We’ll look at ordering a bit more in “Ordering Guarantees”.

时钟读数有一个置信区间

Clock readings have a confidence interval

您也许能够以微秒甚至纳秒的分辨率读取机器的时钟。但即使您可以获得如此细粒度的测量,这并不意味着该值实际上准确到如此精度。事实上,很可能不是——如前所述,即使您每分钟与本地网络上的 NTP 服务器同步,不精确的石英钟的漂移也很容易达到几毫秒。对于公共互联网上的 NTP 服务器,最好的精度可能是几十毫秒,当网络拥塞时,误差可能很容易达到 100 毫秒以上 [57 ]

You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is actually accurate to such precision. In fact, it most likely is not—as mentioned previously, the drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with an NTP server on the local network every minute. With an NTP server on the public internet, the best possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over 100 ms when there is network congestion [57].

因此,将时钟读数视为一个时间点是没有意义的,它更像是一个置信区间内的时间范围:例如,系统可能有 95% 的信心认为现在的时间介于每分钟过去 10.3 秒和 10.5 秒,但它不知道比这更精确的信息 [ 58 ]。如果我们只知道+/-100毫秒的时间,那么时间戳中的微秒数字本质上是没有意义的。

Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [58]. If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.

可以根据您的时间源计算不确定性界限。如果您的计算机直接连接有 GPS 接收器或原子(铯)钟,则制造商会报告预期误差范围。如果您从服务器获取时间,则不确定性基于自上次与服务器同步以来的预期石英漂移,加上 NTP 服务器的不确定性,加上到服务器的网络往返时间(第一个近似值,并假设您信任服务器)。

The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or atomic (caesium) clock directly attached to your computer, the expected error range is reported by the manufacturer. If you’re getting the time from a server, the uncertainty is based on the expected quartz drift since your last sync with the server, plus the NTP server’s uncertainty, plus the network round-trip time to the server (to a first approximation, and assuming you trust the server).

不幸的是,大多数系统不会暴露这种不确定性:例如,当您调用 时 clock_gettime(),返回值不会告诉您时间戳的预期误差,因此您不知道其置信区间是五毫秒还是五年。

Unfortunately, most systems don’t expose this uncertainty: for example, when you call clock_gettime(), the return value doesn’t tell you the expected error of the timestamp, so you don’t know if its confidence interval is five milliseconds or five years.

一个有趣的例外是 Spanner 中的 Google TrueTime API [ 41 ],它明确报告本地时钟的置信区间。当您询问当前时间时,您会得到两个值:,它们是最早可能的时间戳和最晚可能的 时间戳。根据其不确定性计算,时钟知道实际的当前时间位于该间隔内的某个位置。除其他因素外,间隔的宽度取决于自本地石英钟上次与更准确的时钟源同步以来已经过去了多长时间。 [earliest, latest]

An interesting exception is Google’s TrueTime API in Spanner [41], which explicitly reports the confidence interval on the local clock. When you ask it for the current time, you get back two values: [earliest, latest], which are the earliest possible and the latest possible timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is somewhere within that interval. The width of the interval depends, among other things, on how long it has been since the local quartz clock was last synchronized with a more accurate clock source.

全局快照的同步时钟

Synchronized clocks for global snapshots

“快照隔离和可重复读取”中,我们讨论了快照隔离,这对于需要支持小型、快速读写事务和大型、长时间运行的只读事务(例如,用于备份或分析)的数据库来说是一个非常有用的功能。 )。它允许只读事务在特定时间点看到数据库处于一致状态,而不会锁定和干扰读写事务。

In “Snapshot Isolation and Repeatable Read” we discussed snapshot isolation, which is a very useful feature in databases that need to support both small, fast read-write transactions and large, long-running read-only transactions (e.g., for backups or analytics). It allows read-only transactions to see the database in a consistent state at a particular point in time, without locking and interfering with read-write transactions.

快照隔离的最常见实现需要单调递增的事务 ID。如果写入晚于快照发生(即写入的事务 ID 大于快照),则该写入对于快照事务不可见。在单节点数据库上,一个简单的计数器足以生成事务ID。

The most common implementation of snapshot isolation requires a monotonically increasing transaction ID. If a write happened later than the snapshot (i.e., the write has a greater transaction ID than the snapshot), that write is invisible to the snapshot transaction. On a single-node database, a simple counter is sufficient for generating transaction IDs.

但是,当数据库分布在许多机器上(可能位于多个数据中心)时,很难生成全局单调递增事务 ID(跨所有分区),因为它需要协调。事务 ID 必须反映因果关系:如果事务 B 读取事务 A 写入的值,则 B 的事务 ID 必须高于 A,否则快照将不一致。对于大量小型、快速事务,在分布式系统中创建事务 ID 成为难以维持的瓶颈。

However, when a database is distributed across many machines, potentially in multiple datacenters, a global, monotonically increasing transaction ID (across all partitions) is difficult to generate, because it requires coordination. The transaction ID must reflect causality: if transaction B reads a value that was written by transaction A, then B must have a higher transaction ID than A—otherwise, the snapshot would not be consistent. With lots of small, rapid transactions, creating transaction IDs in a distributed system becomes an untenable bottleneck.vi

我们可以使用同步时钟的时间戳作为交易 ID 吗?如果我们能够获得足够好的同步,它们将具有正确的属性:稍后的事务具有更高的时间戳。当然,问题在于时钟精度的不确定性。

Can we use the timestamps from synchronized time-of-day clocks as transaction IDs? If we could get the synchronization good enough, they would have the right properties: later transactions have a higher timestamp. The problem, of course, is the uncertainty about clock accuracy.

Spanner 通过这种方式实现跨数据中心的快照隔离 [ 59 , 60 ]。它使用 TrueTime API 报告的时钟置信区间,并基于以下观察:如果您有两个置信区间,每个置信区间包含最早和最晚的可能时间戳 ( A = [ A最早, A最近]B = [ B最早, B最新]),并且这两个间隔不重叠(即, A最早< A最迟< B最早< B最晚),那么 B 肯定是在 A 之后发生的——这是毫无疑问的。只有当时间间隔重叠时,我们才能确定 A 和 B 发生的顺序。

Spanner implements snapshot isolation across datacenters in this way [59, 60]. It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the following observation: if you have two confidence intervals, each consisting of an earliest and latest possible timestamp (A = [Aearliest, Alatest] and B = [Bearliest, Blatest]), and those two intervals do not overlap (i.e., Aearliest < Alatest < Bearliest < Blatest), then B definitely happened after A—there can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.

为了确保事务时间戳反映因果关系,Spanner 在提交读写事务之前故意等待置信区间的长度。通过这样做,它可以确保任何可能读取数据的事务都处于足够晚的时间,因此它们的置信区间不会重叠。为了保持等待时间尽可能短,Spanner需要保持时钟不确定性尽可能小;为此,谷歌在每个数据中心部署了 GPS 接收器或原子钟,允许时钟同步在大约 7 毫秒内[ 41 ]。

In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [41].

将时钟同步用于分布式事务语义是一个活跃 的研究 领域[ 57,61,62 ]。这些想法很有趣,但尚未在 Google 之外的主流数据库中实现。

Using clock synchronization for distributed transaction semantics is an area of active research [57, 61, 62]. These ideas are interesting, but they have not yet been implemented in mainstream databases outside of Google.

进程暂停

Process Pauses

让我们考虑分布式系统中危险的时钟使用的另一个例子。假设您有一个数据库,每个分区有一个领导者。只有领导者才允许接受写入。节点如何知道它仍然是领导者(它还没有被其他节点宣布死亡),并且它可以安全地接受写入?

Let’s consider another example of dangerous clock use in a distributed system. Say you have a database with a single leader per partition. Only the leader is allowed to accept writes. How does a node know that it is still leader (that it hasn’t been declared dead by the others), and that it may safely accept writes?

一种选择是领导者从其他节点获取租约这类似于具有超时的锁[ 63 ]。任何时候只有一个节点可以持有租约,因此,当一个节点获得租约时,它知道自己在一段时间内是领导者,直到租约到期。为了保持领导者地位,节点必须在租约到期之前定期更新租约。如果节点发生故障,它将停止续订租约,以便另一个节点可以在租约到期时接管。

One option is for the leader to obtain a lease from the other nodes, which is similar to a lock with a timeout [63]. Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the lease, so another node can take over when it expires.

您可以想象请求处理循环看起来像这样:

You can imagine the request-handling loop looking something like this:

while (true) {
    request = getIncomingRequest();

    // Ensure that the lease always has at least 10 seconds remaining
    if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
        lease = lease.renew();
    }

    if (lease.isValid()) {
        process(request);
    }
}
while (true) {
    request = getIncomingRequest();

    // Ensure that the lease always has at least 10 seconds remaining
    if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
        lease = lease.renew();
    }

    if (lease.isValid()) {
        process(request);
    }
}

这段代码有什么问题?首先,它依赖于同步时钟:租约的到期时间由不同的机器设置(例如,到期时间可以计算为当前时间加 30 秒),并将其与本地系统时钟进行比较。如果时钟不同步超过几秒,这段代码将开始做奇怪的事情。

What’s wrong with this code? Firstly, it’s relying on synchronized clocks: the expiry time on the lease is set by a different machine (where the expiry may be calculated as the current time plus 30 seconds, for example), and it’s being compared to the local system clock. If the clocks are out of sync by more than a few seconds, this code will start doing strange things.

其次,即使我们将协议更改为仅使用本地单调时钟,仍然存在另一个问题:代码假设在检查时间点 ( ) 和处理请求的时间点 ( ) 之间经过的时间非常System.currentTimeMillis()process(request)。通常,这段代码运行得非常快,因此 10 秒的缓冲区足以确保租约不会在处理请求的过程中过期。

Secondly, even if we change the protocol to only use the local monotonic clock, there is another problem: the code assumes that very little time passes between the point that it checks the time (System.currentTimeMillis()) and the time when the request is processed (process(request)). Normally this code runs very quickly, so the 10 second buffer is more than enough to ensure that the lease doesn’t expire in the middle of processing a request.

但是,如果程序执行过程中出现意外暂停怎么办?例如,假设线程lease.isValid()在最终继续之前在该行周围停止了 15 秒。在这种情况下,在处理请求时租约很可能已经过期,并且另一个节点已经接管作为领导者。然而,没有什么可以告诉这个线程它已经暂停了这么长时间,因此这段代码不会注意到租约已经过期,直到循环的下一次迭代——到那时它可能已经通过处理要求。

However, what if there is an unexpected pause in the execution of the program? For example, imagine the thread stops for 15 seconds around the line lease.isValid() before finally continuing. In that case, it’s likely that the lease will have expired by the time the request is processed, and another node has already taken over as leader. However, there is nothing to tell this thread that it was paused for so long, so this code won’t notice that the lease has expired until the next iteration of the loop—by which time it may have already done something unsafe by processing the request.

假设线程可能暂停这么长时间是不是很疯狂?不幸的是没有。发生这种情况的原因有多种:

Is it crazy to assume that a thread might be paused for so long? Unfortunately not. There are various reasons why this could happen:

  • 许多编程语言运行时(例如 Java 虚拟机)都有垃圾收集器 (GC),偶尔需要停止所有正在运行的线程。这些“停止世界”的 GC 暂停有时会持续几分钟 [ 64 ]!即使像 HotSpot JVM 的 CMS 这样所谓的“并发”垃圾收集器也不能完全与应用程序代码并行运行——即使它们需要时不时地停止工作 [65 ]。尽管通常可以通过更改分配模式或调整 GC 设置来减少暂停[ 66 ],但如果我们想提供可靠的保证,我们必须假设最坏的情况。

  • Many programming language runtimes (such as the Java Virtual Machine) have a garbage collector (GC) that occasionally needs to stop all running threads. These “stop-the-world” GC pauses have sometimes been known to last for several minutes [64]! Even so-called “concurrent” garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code—even they need to stop the world from time to time [65]. Although the pauses can often be reduced by changing allocation patterns or tuning GC settings [66], we must assume the worst if we want to offer robust guarantees.

  • 在虚拟化环境中,虚拟机可以暂停(暂停所有进程的执行并将内存内容保存到磁盘)和恢复(恢复内存内容并继续执行)。这种暂停可以在进程执行过程中的任何时间发生,并且可以持续任意长度的时间。此功能有时用于将虚拟机从一台主机实时迁移到另一台主机而无需重新启动,在这种情况下,暂停的长度取决于进程写入内存的速率[ 67 ]。

  • In virtualized environments, a virtual machine can be suspended (pausing the execution of all processes and saving the contents of memory to disk) and resumed (restoring the contents of memory and continuing execution). This pause can occur at any time in a process’s execution and can last for an arbitrary length of time. This feature is sometimes used for live migration of virtual machines from one host to another without a reboot, in which case the length of the pause depends on the rate at which processes are writing to memory [67].

  • 在诸如笔记本电脑之类的最终用户设备上,执行也可以任意暂停和恢复,例如,当用户合上笔记本电脑的盖子时。

  • On end-user devices such as laptops, execution may also be suspended and resumed arbitrarily, e.g., when the user closes the lid of their laptop.

  • 当操作系统上下文切换到另一个线程时,或者当管理程序切换到不同的虚拟机时(在虚拟机中运行时),当前正在运行的线程可以在代码中的任意点暂停。对于虚拟机,在其他虚拟机上花费的 CPU 时间称为“窃取时间”。如果机器负载很重(即,如果有很长的线程队列等待运行),则可能需要一些时间才能使暂停的线程再次运行。

  • When the operating system context-switches to another thread, or when the hypervisor switches to a different virtual machine (when running in a virtual machine), the currently running thread can be paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in other virtual machines is known as steal time. If the machine is under heavy load—i.e., if there is a long queue of threads waiting to run—it may take some time before the paused thread gets to run again.

  • 如果应用程序执行同步磁盘访问,则线程可能会暂停,等待慢速磁盘 I/O 操作完成[ 68 ]。在许多语言中,即使代码没有明确提及文件访问,磁盘访问也可能会令人惊讶地发生 - 例如,Java 类加载器在首次使用类文件时延迟加载类文件,这可能在程序执行中的任何时间发生。I/O 暂停和 GC 暂停甚至可能共同组合它们的延迟 [ 69 ]。如果磁盘实际上是网络文件系统或网络块设备(例如亚马逊的EBS),则I/O延迟进一步受到网络延迟变化的影响[ 29 ]。

  • If the application performs synchronous disk access, a thread may be paused waiting for a slow disk I/O operation to complete [68]. In many languages, disk access can happen surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java classloader lazily loads class files when they are first used, which could happen at any time in the program execution. I/O pauses and GC pauses may even conspire to combine their delays [69]. If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the I/O latency is further subject to the variability of network delays [29].

  • 如果操作系统配置为允许交换到磁盘分页),则简单的内存访问可能会导致页面错误,需要将磁盘中的页面加载到内存中。当发生这种缓慢的 I/O 操作时,线程会暂停。如果内存压力很高,则可能需要将不同的页面换出到磁盘。在极端情况下,操作系统可能会花费大部分时间将页面换入内存或换出内存,而几乎没有完成实际工作(这称为抖动)。为了避免这个问题,通常在服务器计算机上禁用分页(如果您宁愿终止进程以释放内存,也不愿冒崩溃的风险)。

  • If the operating system is configured to allow swapping to disk (paging), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory. The thread is paused while this slow I/O operation takes place. If memory pressure is high, this may in turn require a different page to be swapped out to disk. In extreme circumstances, the operating system may spend most of its time swapping pages in and out of memory and getting little actual work done (this is known as thrashing). To avoid this problem, paging is often disabled on server machines (if you would rather kill a process to free up memory than risk thrashing).

  • 可以通过向 Unix 进程发送信号来暂停它SIGSTOP,例如在 shell 中按 Ctrl-Z。该信号立即停止进程获取更多 CPU 周期,直到使用 恢复SIGCONT,此时它会继续在中断处运行。即使您的环境通常不使用SIGSTOP,也可能是运维工程师意外发送的。

  • A Unix process can be paused by sending it the SIGSTOP signal, for example by pressing Ctrl-Z in a shell. This signal immediately stops the process from getting any more CPU cycles until it is resumed with SIGCONT, at which point it continues running where it left off. Even if your environment does not normally use SIGSTOP, it might be sent accidentally by an operations engineer.

所有这些事件都可以在任何时候抢占正在运行的线程,并在稍后的某个时间恢复它,而线程甚至不会注意到。该问题类似于在单机上使多线程代码成为线程安全的:您不能假设任何有关时序的信息,因为可能会发生任意上下文切换和并行性。

All of these occurrences can preempt the running thread at any point and resume it at some later time, without the thread even noticing. The problem is similar to making multi-threaded code on a single machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and parallelism may occur.

当在单台机器上编写多线程代码时,我们有相当好的工具来使其线程安全:互斥体、信号量、原子计数器、无锁数据结构、阻塞队列等等。不幸的是,这些工具不能直接转换为分布式系统,因为分布式系统没有共享内存,只能通过不可靠的网络发送消息。

When writing multi-threaded code on a single machine, we have fairly good tools for making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and so on. Unfortunately, these tools don’t directly translate to distributed systems, because a distributed system has no shared memory—only messages sent over an unreliable network.

分布式系统中的节点必须假设其执行可以在任何时候暂停很长一段时间,即使是在函数中间。在暂停期间,世界的其他部分继续移动,甚至可能宣布暂停的节点死亡,因为它没有响应。最终,暂停的节点可能会继续运行,甚至不会注意到它处于睡眠状态,直到稍后检查其时钟。

A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function. During the pause, the rest of the world keeps moving and may even declare the paused node dead because it’s not responding. Eventually, the paused node may continue running, without even noticing that it was asleep until it checks its clock sometime later.

响应时间保证

Response time guarantees

正如所讨论的,在许多编程语言和操作系统中,线程和进程可能会暂停无限长的时间。如果你足够努力的话,这些暂停的原因是可以消除的。

In many programming languages and operating systems, threads and processes may pause for an unbounded amount of time, as discussed. Those reasons for pausing can be eliminated if you try hard enough.

某些软件在无法在指定时间内响应可能会造成严重损坏的环境中运行:控制飞机、火箭、机器人、汽车和其他物理对象的计算机必须对其传感器输入做出快速且可预测的响应。在这些系统中, 软件必须在指定的期限内作出响应;如果不按时完成,可能会导致整个系统的故障。这些是所谓的硬实时系统。

Some software runs in environments where a failure to respond within a specified time can cause serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects must respond quickly and predictably to their sensor inputs. In these systems, there is a specified deadline by which the software must respond; if it doesn’t meet the deadline, that may cause a failure of the entire system. These are so-called hard real-time systems.

实时真的是真实的吗?

Is real-time really real?

在嵌入式系统中,实时意味着系统经过精心设计和测试,以满足所有情况下指定的时序保证。这个含义与网络上术语“实时”的更模糊的使用形成鲜明对比,实时描述了服务器将数据推送到客户端并进行流处理,而没有严格的响应时间限制(请参见第 11 章)。

In embedded systems, real-time means that a system is carefully designed and tested to meet specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the term real-time on the web, where it describes servers pushing data to clients and stream processing without hard response time constraints (see Chapter 11).

例如,如果您汽车的车载传感器检测到您当前正在经历碰撞,您不希望由于安全气囊释放系统中不合时宜的 GC 暂停而延迟安全气囊的释放。

For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag release system.

在系统中提供实时保证需要软件堆栈各个级别的支持:需要一个 实时操作系统(RTOS),它允许在指定的时间间隔内调度进程并保证分配CPU时间;库函数必须记录其最坏情况下的执行时间;动态内存分配可能会受到限制或完全禁止(存在实时垃圾收集器,但应用程序仍必须确保它不会给 GC 带来太多工作要做);并且必须进行大量的测试和测量以确保满足保证。

Providing real-time guarantees in a system requires support from all levels of the software stack: a real-time operating system (RTOS) that allows processes to be scheduled with a guaranteed allocation of CPU time in specified intervals is needed; library functions must document their worst-case execution times; dynamic memory allocation may be restricted or disallowed entirely (real-time garbage collectors exist, but the application must still ensure that it doesn’t give the GC too much work to do); and an enormous amount of testing and measurement must be done to ensure that guarantees are being met.

所有这些都需要大量的额外工作,并严重限制了可以使用的编程语言、库和工具的范围(因为大多数语言和工具不提供实时保证)。由于这些原因,开发实时系统非常昂贵,并且它们最常用于安全关键型嵌入式设备。此外,“实时”并不等同于“高性能”——事实上,实时系统可能具有较低的吞吐量,因为它们必须将及时响应放在首位(另请参阅“延迟和资源利用率) 。

All of this requires a large amount of additional work and severely restricts the range of programming languages, libraries, and tools that can be used (since most languages and tools do not provide real-time guarantees). For these reasons, developing real-time systems is very expensive, and they are most commonly used in safety-critical embedded devices. Moreover, “real-time” is not the same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to prioritize timely responses above all else (see also “Latency and Resource Utilization”).

对于大多数服务器端数据处理系统来说,实时保证根本不经济也不合适。因此,这些系统必须遭受因在非实时环境中运行而产生的暂停和时钟不稳定的问题。

For most server-side data processing systems, real-time guarantees are simply not economical or appropriate. Consequently, these systems must suffer the pauses and clock instability that come from operating in a non-real-time environment.

限制垃圾收集的影响

Limiting the impact of garbage collection

可以减轻进程暂停的负面影响,而无需诉诸昂贵的实时调度保证。语言运行时在安排垃圾收集时具有一定的灵活性,因为它们可以随着时间的推移跟踪对象分配的速率和剩余的可用内存。

The negative effects of process pauses can be mitigated without resorting to expensive real-time scheduling guarantees. Language runtimes have some flexibility around when they schedule garbage collections, because they can track the rate of object allocation and the remaining free memory over time.

一种新兴的想法是将 GC 暂停视为节点的短暂计划中断,并在一个节点收集其垃圾时让其他节点处理来自客户端的请求。如果运行时可以警告应用程序某个节点很快需要 GC 暂停,则应用程序可以停止向该节点发送新请求,等待其完成处理未完成的请求,然后在没有请求正在进行时执行 GC。这个技巧向客户端隐藏了 GC 暂停,并减少了响应时间的高百分位数 [ 70 , 71 ]。一些对延迟敏感的金融交易系统[ 72 ]使用这种方法。

An emerging idea is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage. If the runtime can warn the application that a node soon requires a GC pause, the application can stop sending new requests to that node, wait for it to finish processing outstanding requests, and then perform the GC while no requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles of response time [70, 71]. Some latency-sensitive financial trading systems [72] use this approach.

这个想法的一个变体是仅对短期对象(收集速度很快)使用垃圾收集器,并在它们积累足够的长期对象以需要长期对象的完整GC之前定期重新启动进程[ 6573 ]。一次可以重新启动一个节点,并且可以在计划重新启动之前将流量从该节点转移出去,就像滚动升级一样(请参阅第 4 章)。

A variant of this idea is to use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC of long-lived objects [65, 73]. One node can be restarted at a time, and traffic can be shifted away from the node before the planned restart, like in a rolling upgrade (see Chapter 4).

这些措施不能完全防止垃圾收集暂停,但可以有效减少其对应用程序的影响。

These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their impact on the application.

知识、真理与谎言

Knowledge, Truth, and Lies

到目前为止,在本章中,我们已经探讨了分布式系统与在单台计算机上运行的程序的不同之处:没有共享内存,只有通过具有可变延迟的不可靠网络传递消息,并且系统可能会遭受部分故障,时钟不可靠,处理暂停。

So far in this chapter we have explored the ways in which distributed systems are different from programs running on a single computer: there is no shared memory, only message passing via an unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, and processing pauses.

如果您不习惯分布式系统,这些问题的后果会让人非常困惑。网络中的节点无法确定任何事情——它只能根据通过网络接收(或未接收)的消息进行猜测。一个节点只能通过与另一个节点交换消息来了解另一个节点所处的状态(它存储了哪些数据,是否正常运行等)。如果远程节点没有响应,则无法知道它处于什么状态,因为无法可靠地区分网络中的问题与节点上的问题。

The consequences of these issues are profoundly disorienting if you’re not used to distributed systems. A node in the network cannot know anything for sure—it can only make guesses based on the messages it receives (or doesn’t receive) via the network. A node can only find out what state another node is in (what data it has stored, whether it is correctly functioning, etc.) by exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state it is in, because problems in the network cannot reliably be distinguished from problems at a node.

对这些系统的讨论接近于哲学:我们知道我们的系统中什么是真或假?如果感知和测量机制不可靠,我们对这些知识的把握有多大?软件系统是否应该遵守我们期望的物理世界法则,例如因果关系?

Discussions of these systems border on the philosophical: What do we know to be true or false in our system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are unreliable? Should software systems obey the laws that we expect of the physical world, such as cause and effect?

幸运的是,我们不需要去弄清楚生命的意义。在分布式系统中,我们可以陈述我们对行为(系统模型)所做的假设,并以满足这些假设的方式设计实际系统。可以证明算法在特定的系统模型中可以正确运行。这意味着即使底层系统模型提供的保证很少,也可以实现可靠的行为。

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.

然而,尽管可以使软件在不可靠的系统模型中表现良好,但这并不容易。在本章的其余部分中,我们将进一步探讨分布式系统中知识和真理的概念,这将帮助我们思考我们可以做出的假设类型以及我们可能想要提供的保证。在第 9 章中,我们将继续研究分布式系统的一些示例,以及在特定假设下提供特定保证的算法。

However, although it is possible to make software well behaved in an unreliable system model, it is not straightforward to do so. In the rest of this chapter we will further explore the notions of knowledge and truth in distributed systems, which will help us think about the kinds of assumptions we can make and the guarantees we may want to provide. In Chapter 9 we will proceed to look at some examples of distributed systems, algorithms that provide particular guarantees under particular assumptions.

真理是由多数人决定的

The Truth Is Defined by the Majority

想象一个存在非对称故障的网络:一个节点能够接收发送给它的所有消息,但从该节点发出的任何传出消息都会被丢弃或延迟[ 19 ]。即使该节点工作得很好,并且正在接收来自其他节点的请求,其他节点也听不到它的响应。经过一段时间的超时后,其他节点宣布它死亡,因为它们没有收到该节点的消息。情况像一场噩梦一样展开:半断开的节点被拖到墓地,一边踢一边尖叫“我没有死!”——但由于没有人能听到它的尖叫声,葬礼队伍以坚忍的决心继续前进。

Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but any outgoing messages from that node are dropped or delayed [19]. Even though that node is working perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the graveyard, kicking and screaming “I’m not dead!”—but since nobody can hear its screaming, the funeral procession continues with stoic determination.

在稍微不那么噩梦般的场景中,半断开节点可能会注意到它正在发送的消息没有被其他节点确认,因此意识到网络中一定存在故障。然而,该节点被其他节点错误地宣告死亡,并且半断开节点对此无能为力。

In a slightly less nightmarish scenario, the semi-disconnected node may notice that the messages it is sending are not being acknowledged by other nodes, and so realize that there must be a fault in the network. Nevertheless, the node is wrongly declared dead by the other nodes, and the semi-disconnected node cannot do anything about it.

作为第三种场景,想象一个节点经历了长时间的停止世界垃圾收集暂停。该节点的所有线程都被GC抢占并暂停一分钟,因此不会处理任何请求,也不会发送任何响应。其他节点等待、重试、变得不耐烦,最终宣布该节点死亡并将其装载到灵车上。最后,GC 完成,节点的线程继续运行,就像什么也没发生一样。其他节点感到惊讶的是,本应死亡的节点突然从棺材中抬起头,健康状况良好,并开始与旁观者愉快地聊天。起初,GCing 节点甚至没有意识到已经过去了一整分钟,并且它被宣布死亡——从它的角度来看,自从它上次与其他节点通信以来几乎没有过去任何时间。

As a third scenario, imagine a node that experiences a long stop-the-world garbage collection pause. All of the node’s threads are preempted by the GC and paused for one minute, and consequently, no requests are processed and no responses are sent. The other nodes wait, retry, grow impatient, and eventually declare the node dead and load it onto the hearse. Finally, the GC finishes and the node’s threads continue as if nothing had happened. The other nodes are surprised as the supposedly dead node suddenly raises its head out of the coffin, in full health, and starts cheerfully chatting with bystanders. At first, the GCing node doesn’t even realize that an entire minute has passed and that it was declared dead—from its perspective, hardly any time has passed since it was last talking to the other nodes.

这些故事的寓意是,节点不一定相信自己对情况的判断。分布式系统不能完全依赖于单个节点,因为节点可能随时发生故障,从而可能导致系统卡住且无法恢复。相反,许多分布式算法依赖于仲裁,即节点之间的投票(请参阅“读写的仲裁”):决策需要来自多个节点的最小投票数,以减少对任何一个特定节点的依赖。

The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum, that is, voting among the nodes (see “Quorums for reading and writing”): decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.

这包括关于宣布节点死亡的决定。如果节点的法定数量宣布另一个节点死亡,那么它必须被视为死亡,即使该节点仍然感觉还活着。单个节点必须遵守法定人数决定并下台。

That includes decisions about declaring nodes dead. If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive. The individual node must abide by the quorum decision and step down.

最常见的是,法定人数是超过一半节点的绝对多数(尽管其他类型的法定人数也是可能的)。多数仲裁允许系统在个别节点发生故障时继续工作(三个节点可以容忍一次故障,五个节点可以容忍两次故障)。然而,它仍然是安全的,因为系统中只能有一个多数——不能有两个多数同时做出相互冲突的决定。当我们在第 9 章讨论共识算法时,我们将更详细地讨论仲裁的使用。

Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds of quorums are possible). A majority quorum allows the system to continue working if individual nodes have failed (with three nodes, one failure can be tolerated; with five nodes, two failures can be tolerated). However, it is still safe, because there can only be only one majority in the system—there cannot be two majorities with conflicting decisions at the same time. We will discuss the use of quorums in more detail when we get to consensus algorithms in Chapter 9.

领导者和锁

The leader and the lock

通常,系统只要求某件事只有其中之一。例如:

Frequently, a system requires there to be only one of some thing. For example:

  • 一个数据库分区只允许有一个节点作为领导者,以避免脑裂(参见 “处理节点中断”)。

  • Only one node is allowed to be the leader for a database partition, to avoid split brain (see “Handling Node Outages”).

  • 只允许一个事务或客户端持有特定资源或对象的锁,以防止并发写入和损坏它。

  • Only one transaction or client is allowed to hold the lock for a particular resource or object, to prevent concurrently writing to it and corrupting it.

  • 只允许一个用户注册特定的用户名,因为用户名必须唯一地标识一个用户。

  • Only one user is allowed to register a particular username, because a username must uniquely identify a user.

在分布式系统中实现这一点需要小心:即使一个节点认为它是“被选中的节点”(分区的领导者、锁的持有者、成功获取用户名的用户的请求处理程序),这并不意味着它是“被选中的节点”。并不一定意味着法定人数的节点同意!一个节点可能以前是领导者,但如果其他节点同时宣布它死亡(例如,由于网络中断或 GC 暂停),则该节点可能已被降级,并且另一个领导者可能已被选举。

Implementing this in a distributed system requires care: even if a node believes that it is “the chosen one” (the leader of the partition, the holder of the lock, the request handler of the user who successfully grabbed the username), that doesn’t necessarily mean a quorum of nodes agrees! A node may have formerly been the leader, but if the other nodes declared it dead in the meantime (e.g., due to a network interruption or GC pause), it may have been demoted and another leader may have already been elected.

如果一个节点继续充当选定的节点,即使大多数节点已宣布其死亡,也可能会在设计不仔细的系统中引起问题。这样的节点可以以其自指定的身份向其他节点发送消息,如果其他节点相信它,则整个系统可能会做一些错误的事情。

If a node continues acting as the chosen one, even though the majority of nodes have declared it dead, it could cause problems in a system that is not carefully designed. Such a node could send messages to other nodes in its self-appointed capacity, and if other nodes believe it, the system as a whole may do something incorrect.

例如,图 8-4显示了由于不正确的锁定实现而导致的数据损坏错误。(这个错误不是理论上的:HBase 曾经有过这个问题 [ 74 , 75 ]。)假设你想确保存储服务中的一个文件一次只能被一个客户端访问,因为如果多个客户端尝试写入对此,文件将被损坏。您尝试通过要求客户端在访问文件之前从锁定服务获取租约来实现此目的。

For example, Figure 8-4 shows a data corruption bug due to an incorrect implementation of locking. (The bug is not theoretical: HBase used to have this problem [74, 75].) Say you want to ensure that a file in a storage service can only be accessed by one client at a time, because if multiple clients tried to write to it, the file would become corrupted. You try to implement this by requiring a client to obtain a lease from a lock service before accessing the file.

迪迪亚0804
图 8-4。分布式锁的错误实现:客户端 1 认为它仍然具有有效的租约,即使租约已过期,因此损坏了存储中的文件。

这个问题是我们在“进程暂停”中讨论的一个例子:如果持有租约的客户端暂停时间太长,它的租约就会过期。另一个客户端可以获得同一文件的租约,并开始写入该文件。当暂停的客户端返回时,它(错误地)相信它仍然具有有效的租约并继续写入该文件。结果,客户端的写入发生冲突并损坏文件。

The problem is an example of what we discussed in “Process Pauses”: if the client holding the lease is paused for too long, its lease expires. Another client can obtain a lease for the same file, and start writing to the file. When the paused client comes back, it believes (incorrectly) that it still has a valid lease and proceeds to also write to the file. As a result, the clients’ writes clash and corrupt the file.

击剑令牌

Fencing tokens

当使用锁或租约来保护对某些资源(例如 图 8-4中的文件存储)的访问时,我们需要确保错误地认为自己是“被选中的节点”的节点不能破坏其余的资源。系统。实现这一目标的相当简单的技术称为“隔离” ,如图 8-5所示。

When using a lock or lease to protect access to some resource, such as the file storage in Figure 8-4, we need to ensure that a node that is under a false belief of being “the chosen one” cannot disrupt the rest of the system. A fairly simple technique that achieves this goal is called fencing, and is illustrated in Figure 8-5.

迪迪亚0805
图 8-5。仅允许按照隔离令牌递增的顺序进行写入,从而确保对存储的访问安全。

假设锁服务器每次授予锁或租约时,它还会返回一个隔离令牌,该数字是每次授予锁时都会增加的数字(例如,由锁服务递增)。然后,我们可以要求每次客户端向存储服务发送写入请求时,都必须包含其当前的隔离令牌。

Let’s assume that every time the lock server grants a lock or lease, it also returns a fencing token, which is a number that increases every time a lock is granted (e.g., incremented by the lock service). We can then require that every time a client sends a write request to the storage service, it must include its current fencing token.

图 8-5中,客户端 1 使用令牌 33 获取租约,但随后它进入长时间暂停状态,租约到期。客户端 2 使用令牌 34 获取租约(数字总是增加),然后将其写入请求发送到存储服务,其中包含令牌 34。稍后,客户端 1 恢复正常并将其写入发送到存储服务,包括其令牌值 33。但是,存储服务器记得它已经处理了具有更高令牌号 (34) 的写入,因此它拒绝具有令牌 33 的请求。

In Figure 8-5, client 1 acquires the lease with a token of 33, but then it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the number always increases) and then sends its write request to the storage service, including the token of 34. Later, client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33.

如果使用ZooKeeper作为锁服务,则可以使用 交易IDzxid或节点版本 作为fencing token。cversion由于它们保证单调递增,因此它们具有所需的属性[ 74 ]。

If ZooKeeper is used as lock service, the transaction ID zxid or the node version cversion can be used as fencing token. Since they are guaranteed to be monotonically increasing, they have the required properties [74].

请注意,此机制要求资源本身在检查令牌方面发挥积极作用,方法是拒绝使用比已处理令牌更旧的令牌进行的任何写入,仅依靠客户端本身检查其锁定状态是不够的。对于未明确支持防护令牌的资源,您仍然可以解决该限制(例如,对于文件存储服务,您可以在文件名中包含防护令牌)。然而,某种检查是必要的,以避免在锁的保护之外处理请求。

Note that this mechanism requires the resource itself to take an active role in checking tokens by rejecting any writes with an older token than one that has already been processed—it is not sufficient to rely on clients checking their lock status themselves. For resources that do not explicitly support fencing tokens, you might still be able work around the limitation (for example, in the case of a file storage service you could include the fencing token in the filename). However, some kind of check is necessary to avoid processing requests outside of the lock’s protection.

在服务器端检查令牌可能看起来是一个缺点,但这可以说是一件好事:对于服务来说,假设其客户端始终表现良好是不明智的,因为客户端通常由优先级非常高的人运行。与运行该服务的人员的优先事项不同[ 76 ]。因此,对于任何服务来说,保护自己免受意外滥用客户的侵害都是一个好主意。

Checking a token on the server side may seem like a downside, but it is arguably a good thing: it is unwise for a service to assume that its clients will always be well behaved, because the clients are often run by people whose priorities are very different from the priorities of the people running the service [76]. Thus, it is a good idea for any service to protect itself from accidentally abusive clients.

拜占庭错误

Byzantine Faults

防护令牌可以检测并阻止无意中错误操作 的节点(例如,因为它尚未发现其租约已过期)。然而,如果节点故意想要破坏系统的保证,它可以通过发送带有假围栏令牌的消息轻松实现。

Fencing tokens can detect and block a node that is inadvertently acting in error (e.g., because it hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing token.

在本书中,我们假设节点不可靠但诚实:它们可能很慢或永远不会响应(由于故障),并且它们的状态可能已过时(由于 GC 暂停或网络延迟),但我们假设如果一个节点它确实做出了回应,但它说的是“真相”:据其所知,它正在遵守协议规则。

In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume that if a node does respond, it is telling the “truth”: to the best of its knowledge, it is playing by the rules of the protocol.

如果存在节点可能“撒谎”(发送任意错误或损坏的响应)的风险,那么分布式系统问题就会变得更加困难,例如,如果节点可能声称已收到特定消息,但实际上并未收到。这种行为被称为拜占庭错误,而在这种不信任的环境中达成共识的问题被称为拜占庭将军问题 [ 77 ]。

Distributed systems problems become much harder if there is a risk that nodes may “lie” (send arbitrary faulty or corrupted responses)—for example, if a node may claim to have received a particular message when in fact it didn’t. Such behavior is known as a Byzantine fault, and the problem of reaching consensus in this untrusting environment is known as the Byzantine Generals Problem [77].

如果系统在某些节点发生故障且不遵守协议或恶意攻击者干扰网络时仍能继续正确运行,则该系统具有拜占庭容错能力这种担忧在某些特定情况下是相关的。例如:

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network. This concern is relevant in certain specific circumstances. For example:

  • 在航空航天环境中,计算机内存或 CPU 寄存器中的数据可能会因辐射而损坏,导致其以任意不可预测的方式响应其他节点。由于系统故障的代价非常高昂(例如,飞机坠毁并杀死机上所有人,或者火箭与国际空间站相撞),因此飞行控制系统必须容忍拜占庭故障[81 , 82 ]

  • In aerospace environments, the data in a computer’s memory or CPU register could become corrupted by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board, or a rocket colliding with the International Space Station), flight control systems must tolerate Byzantine faults [81, 82].

  • 在具有多个参与组织的系统中,一些参与者可能会试图欺骗或欺骗其他参与者。在这种情况下,节点简单地信任另一个节点的消息是不安全的,因为它们可能是恶意发送的。例如,像比特币和其他区块链这样的点对点网络可以被认为是一种让相互不信任的各方就交易是否发生达成一致的方式,而不依赖于中央机构[83 ]

  • In a system with multiple participating organizations, some participants may attempt to cheat or defraud others. In such circumstances, it is not safe for a node to simply trust another node’s messages, since they may be sent with malicious intent. For example, peer-to-peer networks like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority [83].

然而,在我们在本书中讨论的系统类型中,我们通常可以安全地假设不存在拜占庭错误。在您的数据中心中,所有节点都由您的组织控制(因此它们有望被信任),并且辐射水平足够低,内存损坏不是主要问题。使系统具有拜占庭容错能力的协议相当复杂[ 84 ],而容错嵌入式系统依赖于硬件层面的支持[ 81 ]。在大多数服务器端数据系统中,部署拜占庭容错解决方案的成本使其不切实际。

However, in the kinds of systems we discuss in this book, we can usually safely assume that there are no Byzantine faults. In your datacenter, all the nodes are controlled by your organization (so they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a major problem. Protocols for making systems Byzantine fault-tolerant are quite complicated [84], and fault-tolerant embedded systems rely on support from the hardware level [81]. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impractical.

Web 应用程序确实需要预料到最终用户控制下的客户端(例如 Web 浏览器)的任意和恶意行为。这就是为什么输入验证、清理和输出转义如此重要:例如,防止 SQL 注入和跨站点脚本编写。然而,我们在这里通常不使用拜占庭容错协议,而只是让服务器有权决定客户端的行为是允许的和不允许的。在没有这样的中央权威的对等网络中,拜占庭容错更为重要。

Web applications do need to expect arbitrary and malicious behavior of clients that are under end-user control, such as web browsers. This is why input validation, sanitization, and output escaping are so important: to prevent SQL injection and cross-site scripting, for example. However, we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where there is no such central authority, Byzantine fault tolerance is more relevant.

软件中的错误可以被视为拜占庭故障,但如果你将相同的软件部署到所有节点,那么拜占庭容错算法就无法拯救你。大多数拜占庭容错算法需要超过三分之二的节点才能正常运行(即,如果有四个节点,最多有一个可能发生故障)。要使用这种方法来消除错误,您必须拥有同一软件的四个独立实现,并希望错误仅出现在四个实现之一中。

A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant algorithms require a supermajority of more than two-thirds of the nodes to be functioning correctly (i.e., if you have four nodes, at most one may malfunction). To use this approach against bugs, you would have to have four independent implementations of the same software and hope that a bug only appears in one of the four implementations.

同样,如果协议能够保护我们免受漏洞、安全损害和恶意攻击,那将很有吸引力。不幸的是,这也不现实:在大多数系统中,如果攻击者可以危害一个节点,他们很可能会危害所有节点,因为它们可能运行相同的软件。因此,传统机制(身份验证、访问控制、加密、防火墙等)仍然是抵御攻击者的主要保护措施。

Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if an attacker can compromise one node, they can probably compromise all of them, because they are probably running the same software. Thus, traditional mechanisms (authentication, access control, encryption, firewalls, and so on) continue to be the main protection against attackers.

弱的说谎形式

Weak forms of lying

尽管我们假设节点通常是诚实的,但值得在软件中添加机制来防止弱形式的“说谎”——例如,由于硬件问题、软件错误和配置错误而导致的无效消息。这种保护机制并不是成熟的拜占庭式容错能力,因为它们无法抵御坚定的对手,但它们仍然是实现更高可靠性的简单而务实的步骤。例如:

Although we assume that nodes are generally honest, it can be worth adding mechanisms to software that guard against weak forms of “lying”—for example, invalid messages due to hardware issues, software bugs, and misconfiguration. Such protection mechanisms are not full-blown Byzantine fault tolerance, as they would not withstand a determined adversary, but they are nevertheless simple and pragmatic steps toward better reliability. For example:

  • 网络数据包有时会由于硬件问题或操作系统、驱动程序 、路由器等中的错误而被损坏。通常,损坏的数据包会被 TCP 和 UDP 内置的 校验和捕获,但有时它们会逃避检测[ 85、86、87 ] 简单的措施通常足以防止此类损坏,例如应用程序级协议中的校验和。

  • Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems, drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and UDP, but sometimes they evade detection [85, 86, 87]. Simple measures are usually sufficient protection against such corruption, such as checksums in the application-level protocol.

  • 可公开访问的应用程序必须仔细清理用户的任何输入,例如检查值是否在合理范围内并限制字符串的大小以防止通过大量内存分配造成拒绝服务。防火墙后面的内部服务可能能够摆脱对输入不太严格的检查,但对值进行一些基本的健全性检查(例如,在协议解析中[ 85 ])是一个好主意。

  • A publicly accessible application must carefully sanitize any inputs from users, for example checking that a value is within a reasonable range and limiting the size of strings to prevent denial of service through large memory allocations. An internal service behind a firewall may be able to get away with less strict checks on inputs, but some basic sanity-checking of values (e.g., in protocol parsing [85]) is a good idea.

  • NTP 客户端可以配置多个服务器地址。同步时,客户端会联系所有服务器,估计它们的错误,并检查大多数服务器是否在某个时间范围上达成一致。只要大多数服务器都正常,报告错误时间的配置错误的 NTP 服务器就会被检测为异常值并被排除在同步之外 [ 37 ]。使用多个服务器使 NTP 比仅使用单个服务器更加健壮。

  • NTP clients can be configured with multiple server addresses. When synchronizing, the client contacts all of them, estimates their errors, and checks that a majority of servers agree on some time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an incorrect time is detected as an outlier and is excluded from synchronization [37]. The use of multiple servers makes NTP more robust than if it only uses a single server.

系统模型与现实

System Model and Reality

许多算法被设计用来解决分布式系统问题——例如,我们将在第 9 章中研究共识问题的解决方案。为了发挥作用,这些算法需要容忍我们在本章中讨论的分布式系统的各种故障。

Many algorithms have been designed to solve distributed systems problems—for example, we will examine solutions for the consensus problem in Chapter 9. In order to be useful, these algorithms need to tolerate the various faults of distributed systems that we discussed in this chapter.

算法的编写方式不能过于依赖运行算法的硬件和软件配置的细节。这反过来要求我们以某种方式形式化我们期望在系统中发生的故障类型。我们通过定义系统模型来做到这一点,系统模型是描述算法可能假设的事物的抽象。

Algorithms need to be written in a way that does not depend too heavily on the details of the hardware and software configuration on which they are run. This in turn requires that we somehow formalize the kinds of faults that we expect to happen in a system. We do this by defining a system model, which is an abstraction that describes what things an algorithm may assume.

关于时序假设,常用三种系统模型:

With regard to timing assumptions, three system models are in common use:

同步模型
Synchronous model

同步模型假设有界网络延迟、有界进程暂停和有界时钟误差。这并不意味着完全同步的时钟或零网络延迟;它只是意味着您知道网络延迟、暂停和时钟漂移永远不会超过某个固定的上限 [ 88 ]。同步模型并不是大多数实际系统的现实模型,因为(如本章所讨论的)无限延迟和暂停确实会发生。

The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply exactly synchronized clocks or zero network delay; it just means you know that network delay, pauses, and clock drift will never exceed some fixed upper bound [88]. The synchronous model is not a realistic model of most practical systems, because (as discussed in this chapter) unbounded delays and pauses do occur.

部分同步模型
Partially synchronous model

部分同步意味着系统在大多数时间表现得像同步系统,但有时会超出网络延迟、进程暂停和时钟漂移的界限[ 88 ]。这是许多系统的现实模型:大多数时候,网络和流程都表现得很好,否则我们将永远无法完成任何事情,但我们必须考虑到这样一个事实,即任何时序假设都可能偶尔被打破。发生这种情况时,网络延迟、暂停和时钟误差可能会变得任意大。

Partial synchrony means that a system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift [88]. This is a realistic model of many systems: most of the time, networks and processes are quite well behaved—otherwise we would never be able to get anything done—but we have to reckon with the fact that any timing assumptions may be shattered occasionally. When this happens, network delay, pauses, and clock error may become arbitrarily large.

异步模型
Asynchronous model

在这个模型中,算法不允许做出任何时序假设——事实上,它甚至没有时钟(因此它不能使用超时)。有些算法可以为异步模型设计,但限制很大。

In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not even have a clock (so it cannot use timeouts). Some algorithms can be designed for the asynchronous model, but it is very restrictive.

此外,除了时序问题之外,我们还必须考虑节点故障。三种最常见的节点系统模型是:

Moreover, besides timing issues, we have to consider node failures. The three most common system models for nodes are:

急停故障
Crash-stop faults

在紧急停止模型中,算法可以假设节点只能以一种方式发生故障,即崩溃。这意味着该节点可能随时突然停止响应,此后该节点就永远消失了——它再也不会回来。

In the crash-stop model, an algorithm may assume that a node can fail in only one way, namely by crashing. This means that the node may suddenly stop responding at any moment, and thereafter that node is gone forever—it never comes back.

崩溃恢复故障
Crash-recovery faults

我们假设节点可能随时崩溃,并且可能在某个未知时间后再次开始响应。在崩溃恢复模型中,假设节点具有在崩溃期间保留的稳定存储(即非易失性磁盘存储),而假设内存中的状态丢失。

We assume that nodes may crash at any moment, and perhaps start responding again after some unknown time. In the crash-recovery model, nodes are assumed to have stable storage (i.e., nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed to be lost.

拜占庭式(任意)错误
Byzantine (arbitrary) faults

节点绝对可以做任何事情,包括尝试欺骗和欺骗其他节点,如上一节所述。

Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described in the last section.

对于对真实系统进行建模,具有崩溃恢复故障的部分同步模型通常是最有用的模型。但分布式算法如何应对该模型呢?

For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model. But how do distributed algorithms cope with that model?

算法的正确性

Correctness of an algorithm

为了定义算法的正确性 意味着什么,我们可以描述它的属性。例如,排序算法的输出具有这样的属性:对于输出列表的任意两个不同元素,左边的元素小于右边的元素。这只是定义列表排序含义的一种正式方式。

To define what it means for an algorithm to be correct, we can describe its properties. For example, the output of a sorting algorithm has the property that for any two distinct elements of the output list, the element further to the left is smaller than the element further to the right. That is simply a formal way of defining what it means for a list to be sorted.

同样,我们可以写下我们想要的分布式算法的属性来定义正确的含义。例如,如果我们为锁生成防护令牌(请参阅 “防护令牌”),我们可能要求算法具有以下属性:

Similarly, we can write down the properties we want of a distributed algorithm to define what it means to be correct. For example, if we are generating fencing tokens for a lock (see “Fencing tokens”), we may require the algorithm to have the following properties:

独特性
Uniqueness

对防护令牌的两个请求不会返回相同的值。

No two requests for a fencing token return the same value.

单调序列
Monotonic sequence

如果请求x返回令牌t x 并且请求y返回令牌ty,并且 x在y开始之前完成,t x  <  ty

If request x returned token tx, and request y returned token ty, and x completed before y began, then tx < ty.

可用性
Availability

请求隔离令牌且未崩溃的节点最终会收到响应。

A node that requests a fencing token and does not crash eventually receives a response.

如果一个算法在我们假设的系统模型中可能发生的所有情况下始终满足其属性,则该算法在某个系统模型中是正确的。但这有什么意义呢?如果所有节点崩溃,或者所有网络延迟突然变得无限长,那么任何算法都将无法完成任何事情。

An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model. But how does this make sense? If all nodes crash, or all network delays suddenly become infinitely long, then no algorithm will be able to get anything done.

安全性和活力

Safety and liveness

为了澄清这种情况,有必要区分两种不同类型的属性: 安全性活性属性。在刚刚给出的示例中,唯一性单调序列是安全属性,但可用性是活跃属性。

To clarify the situation, it is worth distinguishing between two different kinds of properties: safety and liveness properties. In the example just given, uniqueness and monotonic sequence are safety properties, but availability is a liveness property.

这两种属性有什么区别?值得注意的是,活性属性的定义中通常包含“最终”一词。(是的,你猜对了——最终一致性是一种活性属性 [ 89 ]。)

What distinguishes the two kinds of properties? A giveaway is that liveness properties often include the word “eventually” in their definition. (And yes, you guessed it—eventual consistency is a liveness property [89].)

安全性通常被非正式地定义为没有发生任何不好的事情,而活跃性则被定义为最终会发生好事。然而,最好不要过多地解读这些非正式的定义,因为好与坏的含义是主观的。安全性和活跃性的实际定义是精确且数学化的[ 90 ]:

Safety is often informally defined as nothing bad happens, and liveness as something good eventually happens. However, it’s best to not read too much into those informal definitions, because the meaning of good and bad is subjective. The actual definitions of safety and liveness are precise and mathematical [90]:

  • 如果违反了安全属性,我们可以指出它被破坏的特定时间点(例如,如果违反了唯一性属性,我们可以识别返回重复防护令牌的特定操作)。安全属性被侵犯后,违规行为就无法挽回——损害已经造成。

  • If a safety property is violated, we can point at a particular point in time at which it was broken (for example, if the uniqueness property was violated, we can identify the particular operation in which a duplicate fencing token was returned). After a safety property has been violated, the violation cannot be undone—the damage is already done.

  • 活跃性属性的工作方式相反:它可能在某个时间点不成立(例如,节点可能已发送请求但尚未收到响应),但总希望它在将来可以得到满足(即通过接收响应)。

  • A liveness property works the other way round: it may not hold at some point in time (for example, a node may have sent a request but not yet received a response), but there is always hope that it may be satisfied in the future (namely by receiving a response).

区分安全性和活性属性的一个优点是它可以帮助我们处理困难的系统模型。对于分布式算法,通常要求安全属性 在系统模型的所有可能情况下始终保持不变[ 88 ]。也就是说,即使所有节点崩溃,或者整个网络失败,算法仍然必须确保它不会返回错误的结果(即,安全属性仍然满足)。

An advantage of distinguishing between safety and liveness properties is that it helps us deal with difficult system models. For distributed algorithms, it is common to require that safety properties always hold, in all possible situations of a system model [88]. That is, even if all nodes crash, or the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong result (i.e., that the safety properties remain satisfied).

然而,对于活性属性,​​我们可以提出警告:例如,我们可以说,仅当大多数节点没有崩溃并且网络最终从中断中恢复时,请求才需要接收响应。部分同步模型的定义要求系统最终恢复到同步状态,即任何一段网络中断只持续有限的持续时间,然后被修复。

However, with liveness properties we are allowed to make caveats: for example, we could say that a request needs to receive a response only if a majority of nodes have not crashed, and only if the network eventually recovers from an outage. The definition of the partially synchronous model requires that eventually the system returns to a synchronous state—that is, any period of network interruption lasts only for a finite duration and is then repaired.

将系统模型映射到现实世界

Mapping system models to the real world

安全性和活跃性属性以及系统模型对于推理分布式算法的正确性非常有用。然而,当在实践中实现算法时,现实中混乱的事实又会再次出现,并且很明显系统模型是现实的简化抽象。

Safety and liveness properties and system models are very useful for reasoning about the correctness of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of reality come back to bite you again, and it becomes clear that the system model is a simplified abstraction of reality.

例如,崩溃恢复模型中的算法通常假设稳定存储中的数据能够在崩溃后幸存下来。但是,如果磁盘上的数据损坏,或者由于硬件错误或配置错误而导致数据被删除,会发生什么情况[ 91 ]?如果服务器存在固件错误并且在重新启动时无法识别其硬盘驱动器,即使驱动器已正确连接到服务器 [ 92 ],会发生什么情况?

For example, algorithms in the crash-recovery model generally assume that data in stable storage survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out due to hardware error or misconfiguration [91]? What happens if a server has a firmware bug and fails to recognize its hard drives on reboot, even though the drives are correctly attached to the server [92]?

仲裁算法(请参阅“读写仲裁”)依赖于节点记住其声称已存储的数据。如果一个节点可能会失忆并忘记以前存储的数据,就会破坏法定人数条件,从而破坏算法的正确性。也许需要一种新的系统模型,在该模型中,我们假设稳定的存储大多数情况下不会崩溃,但有时可能会丢失。但这个模型变得更难以推理。

Quorum algorithms (see “Quorums for reading and writing”) rely on a node remembering the data that it claims to have stored. If a node may suffer from amnesia and forget previously stored data, that breaks the quorum condition, and thus breaks the correctness of the algorithm. Perhaps a new system model is needed, in which we assume that stable storage mostly survives crashes, but may sometimes be lost. But that model then becomes harder to reason about.

算法的理论描述可以声明某些事情只是假设不会发生,而在非拜占庭系统中,我们确实必须对可能发生和不能发生的故障做出一些假设。然而,真正的实现可能仍然需要包含代码来处理发生了被认为不可能的事情的情况,即使这种处理归结为和 printf("Sucks to be you")-exit(666)即让人类操作员清理混乱[ 93 ]。(这可以说是计算机科学和软件工程之间的区别。)

The theoretical description of an algorithm can declare that certain things are simply assumed not to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can and cannot happen. However, a real implementation may still have to include code to handle the case where something happens that was assumed to be impossible, even if that handling boils down to printf("Sucks to be you") and exit(666)—i.e., letting a human operator clean up the mess [93]. (This is arguably the difference between computer science and software engineering.)

这并不是说理论的、抽象的系统模型毫无价值——恰恰相反。它们对于将真实系统的复杂性提炼为我们可以推理的一组可管理的故障非常有帮助,以便我们能够理解问题并尝试系统地解决它。我们可以通过证明算法的属性在某些系统模型中始终成立来证明算法的正确性。

That is not to say that theoretical, abstract system models are worthless—quite the opposite. They are incredibly helpful for distilling down the complexity of real systems to a manageable set of faults that we can reason about, so that we can understand the problem and try to solve it systematically. We can prove algorithms correct by showing that their properties always hold in some system model.

证明算法正确并不意味着它在真实系统上的实现一定会始终正确运行。但这是一个非常好的第一步,因为理论分析可以揭示算法中的问题,这些问题可能在真实系统中隐藏很长时间,并且只有当您的假设(例如,关于时序)因失败而失败时,这些问题才会给您带来麻烦。到不寻常的情况。理论分析和实证检验同样重要。

Proving an algorithm correct does not mean its implementation on a real system will necessarily always behave correctly. But it’s a very good first step, because the theoretical analysis can uncover problems in an algorithm that might remain hidden for a long time in a real system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due to unusual circumstances. Theoretical analysis and empirical testing are equally important.

概括

Summary

在本章中,我们讨论了分布式系统中可能出现的各种问题,包括:

In this chapter we have discussed a wide range of problems that can occur in distributed systems, including:

  • 每当您尝试通过网络发送数据包时,它都可能会丢失或任意延迟。同样,回复也可能会丢失或延迟,因此如果您没有收到回复,您就不知道消息是否已发送。

  • Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether the message got through.

  • 节点的时钟可能与其他节点明显不同步(尽管您尽最大努力设置 NTP),它可能会突然向前或向后跳跃,并且依赖它是危险的,因为您很可能没有好的措施您的时钟的误差间隔。

  • A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you most likely don’t have a good measure of your clock’s error interval.

  • 进程可能在其执行过程中的任何时刻暂停相当长的时间(可能是由于停止世界的垃圾收集器),被其他节点宣布死亡,然后再次复活,而没有意识到它已暂停。

  • A process may pause for a substantial amount of time at any point in its execution (perhaps due to a stop-the-world garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was paused.

可能发生此类部分故障 的事实是分布式系统的定义特征。每当软件尝试执行涉及其他节点的任何操作时,它都有可能偶尔失败,或者随机变慢,或者根本不响应(最终超时)。在分布式系统中,我们尝试在软件中构建对部分故障的容忍度,以便即使系统的某些组成部分出现故障,整个系统也可以继续运行。

The fact that such partial failures can occur is the defining characteristic of distributed systems. Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken.

要容忍错误,第一步是检测错误,但即便如此也很困难。大多数系统没有准确的机制来检测节点是否发生故障,因此大多数分布式算法依赖超时来确定远程节点是否仍然可用。然而,超时无法区分网络故障和节点故障,并且可变的网络延迟有时会导致节点被错误地怀疑崩溃。此外,有时节点可能处于降级状态:例如,由于驱动程序错误,千兆位网络接口的吞吐量可能突然下降到 1 Kb/s [ 94 ]。这种“跛行”但未死亡的节点可能比完全失败的节点更难处理。

To tolerate faults, the first step is to detect them, but even that is hard. Most systems don’t have an accurate mechanism of detecting whether a node has failed, so most distributed algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts can’t distinguish between network and node failures, and variable network delay sometimes causes a node to be falsely suspected of crashing. Moreover, sometimes a node can be in a degraded state: for example, a Gigabit network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [94]. Such a node that is “limping” but not dead can be even more difficult to deal with than a cleanly failed node.

一旦检测到故障,让系统容忍它也不容易:机器之间没有全局变量、没有共享内存、没有常识或任何其他类型的共享状态。节点甚至无法就现在的时间达成一致,更不用说任何更深刻的事情了。信息从一个节点流向另一个节点的唯一方式是通过不可靠的网络发送信息。单个节点无法安全地做出重大决策,因此我们需要协议寻求其他节点的帮助并尝试获得法定人数的同意。

Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge or any other kind of shared state between the machines. Nodes can’t even agree on what time it is, let alone on anything more profound. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree.

如果您习惯于在单台计算机的理想化数学完美中编写软件,其中相同的操作总是确定性地返回相同的结果,那么转向分布式系统的混乱物理现实可能会有点令人震惊。相反,如果一个问题可以在一台计算机上解决,分布式系统工程师通常会认为这个问题是微不足道的[ 5 ],事实上,现在一台计算机可以做很多事情[ 95 ]。如果您可以避免打开潘多拉魔盒并简单地将东西保存在一台机器上,那么通常值得这样做。

If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems engineers will often regard a problem as trivial if it can be solved on a single computer [5], and indeed a single computer can do a lot nowadays [95]. If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.

然而,正如第二部分的简介中所讨论的,可扩展性并不是想要使用分布式系统的唯一原因。容错和低延迟(通过将数据放置在靠近用户的地理位置)是同样重要的目标,而这些目标无法通过单个节点来实现。

However, as discussed in the introduction to Part II, scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things cannot be achieved with a single node.

在本章中,我们还进行了一些离题,探讨网络、时钟和进程的不可靠性是否是不可避免的自然规律。我们发现事实并非如此:可以在网络中提供硬实时响应保证和有限延迟,但这样做非常昂贵,并且会导致硬件资源利用率较低。大多数非安全关键系统选择廉价且不可靠的系统,而不是昂贵且可靠的系统。

In this chapter we also went on some tangents to explore whether the unreliability of networks, clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expensive and reliable.

我们还谈到了超级计算机,它假定组件可靠,因此当组件出现故障时必须完全停止并重新启动。相比之下,分布式系统可以永远运行,而不会在服务级别中断,因为所有故障和维护都可以在节点级别处理——至少在理论上是这样。(实际上,如果将错误的配置更改推广到所有节点,仍然会使分布式系统崩溃。)

We also touched on supercomputers, which assume reliable components and thus have to be stopped and restarted entirely when a component does fail. By contrast, distributed systems can run forever without being interrupted at the service level, because all faults and maintenance can be handled at the node level—at least in theory. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring a distributed system to its knees.)

本章都是关于问题的,给我们带来了黯淡的前景。在下一章中,我们将继续讨论解决方案,并讨论一些旨在解决分布式系统中所有问题的算法。

This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we will move on to solutions, and discuss some algorithms that have been designed to cope with all the problems in distributed systems.

脚注

i但有一个例外:我们假设故障是非拜占庭式的(参见 “拜占庭式故障”)。

i With one exception: we will assume that faults are non-Byzantine (see “Byzantine Faults”).

ii如果启用了 TCP keepalive,则可能除了偶尔的 keepalive 数据包之外。

ii Except perhaps for an occasional keepalive packet, if TCP keepalive is enabled.

iii 异步传输模式(ATM) 是 20 世纪 80 年代以太网的竞争对手 [ 32 ],但它在电话网络核心交换机之外并未得到广泛采用。尽管共享一个缩写词,但它与自动柜员机(也称为提款机)无关。也许,在某个平行宇宙中,互联网是基于 ATM 之类的东西——在那个宇宙中,互联网视频通话可能比我们的世界可靠得多,因为它们不会遭受数据包丢失和延迟的影响。

iii Asynchronous Transfer Mode (ATM) was a competitor to Ethernet in the 1980s [32], but it didn’t gain much adoption outside of telephone network core switches. It has nothing to do with automatic teller machines (also known as cash machines), despite sharing an acronym. Perhaps, in some parallel universe, the internet is based on something like ATM—in that universe, internet video calls are probably a lot more reliable than they are in ours, because they don’t suffer from dropped and delayed packets.

iv互联网服务提供商之间的对等协议以及通过边界网关协议 (BGP) 建立的路由与 IP 本身相比更类似于电路交换。在此级别,可以购买专用带宽。然而,互联网路由在网络级别运行,而不是主机之间的单独连接,并且时间尺度要长得多。

iv Peering agreements between internet service providers and the establishment of routes through the Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of networks, not individual connections between hosts, and at a much longer timescale.

v虽然时钟被称为实时时钟,但它与实时操作系统无关,如“响应时间保证”中所述。

v Although the clock is called real-time, it has nothing to do with real-time operating systems, as discussed in “Response time guarantees”.

vi有分布式序列号生成器,例如 Twitter 的 Snowflake,它们以可扩展的方式生成近似单调递增的唯一 ID(例如,通过将 ID 空间块分配给不同的节点)。然而,它们通常无法保证与因果关系一致的排序,因为分配 ID 块的时间尺度比数据库读写的时间尺度长。另请参阅“订购保证”

vi There are distributed sequence number generators, such as Twitter’s Snowflake, that generate approximately monotonically increasing unique IDs in a scalable way (e.g., by allocating blocks of the ID space to different nodes). However, they typically cannot guarantee an ordering that is consistent with causality, because the timescale at which blocks of IDs are assigned is longer than the timescale of database reads and writes. See also “Ordering Guarantees”.

参考

[ 1 ] Mark Cavage:“这是无可避免的:你正在构建一个分布式系统”,ACM Queue,第 11 卷,第 4 期,第 80-89 页,2013 年 4 月 。doi:10.1145/2466486.2482856

[1] Mark Cavage: “There’s Just No Getting Around It: You’re Building a Distributed System,” ACM Queue, volume 11, number 4, pages 80-89, April 2013. doi:10.1145/2466486.2482856

[ 2 ] Jay Kreps:“真正了解分布式系统可靠性”,blog.empathybox.com,2012 年 3 月 19 日。

[2] Jay Kreps: “Getting Real About Distributed System Reliability,” blog.empathybox.com, March 19, 2012.

[ 3 ] 悉尼帕多瓦:洛夫莱斯和巴贝奇的惊心动魄的冒险:第一台计算机的(大部分)真实故事。特别书籍,2015 年 4 月。ISBN:978-0-141-98151-2

[3] Sydney Padua: The Thrilling Adventures of Lovelace and Babbage: The (Mostly) True Story of the First Computer. Particular Books, April 2015. ISBN: 978-0-141-98151-2

[ 4 ] Coda Hale:“你不能牺牲分区容错性”,codahale.com,2010 年 10 月 7 日。

[4] Coda Hale: “You Can’t Sacrifice Partition Tolerance,” codahale.com, October 7, 2010.

[ 5 ] Jeff Hodges:“年轻人的分布式系统笔记”,somethingsimilar.com,2013 年 1 月 14 日。

[5] Jeff Hodges: “Notes on Distributed Systems for Young Bloods,” somethingsimilar.com, January 14, 2013.

[ 6 ] Antonio Regalado:“谁创造了‘云计算’?”,technologyreview.com,2011 年 10 月 31 日。

[6] Antonio Regalado: “Who Coined ‘Cloud Computing’?,” technologyreview.com, October 31, 2011.

[ 7 ] Luiz André Barroso、Jimmy Clidaras 和 Urs Hölzle:“作为计算机的数据中心:仓库规模机器设计简介,第二版” ,计算机体系结构综合讲座,第 8 卷,第 3 期,Morgan & Claypool 出版社,2013 年 7 月 。doi:10.2200/S00516ED2V01Y201306CAC024,ISBN: 978-1-627-05010-4

[7] Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle: “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition,” Synthesis Lectures on Computer Architecture, volume 8, number 3, Morgan & Claypool Publishers, July 2013. doi:10.2200/S00516ED2V01Y201306CAC024, ISBN: 978-1-627-05010-4

[ 8 ] David Fiala、Frank Mueller、Christian Engelmann 等人:“大规模高性能计算的静默数据损坏检测和纠正”, 高性能计算、网络、存储和分析国际会议(SC12) ,2012 年 11 月。

[8] David Fiala, Frank Mueller, Christian Engelmann, et al.: “Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing,” at International Conference for High Performance Computing, Networking, Storage and Analysis (SC12), November 2012.

[ 9 ] Arjun Singh、Joon Ong、Amit Agarwal 等人:“ Jupiter Rising:Google 数据中心网络中 Clos 拓扑和集中控制的十年”,在 ACM 数据通信特别兴趣小组(SIGCOMM) 年会上, 2015 年 8 月 。doi:10.1145/2785956.2787508

[9] Arjun Singh, Joon Ong, Amit Agarwal, et al.: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network,” at Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015. doi:10.1145/2785956.2787508

[ 10 ] Glenn K. Lockwood:“ Hadoop 在 HPC 中的不适应”,glennklockwood.blogspot.co.uk,2014 年 5 月 16 日。

[10] Glenn K. Lockwood: “Hadoop’s Uncomfortable Fit in HPC,” glennklockwood.blogspot.co.uk, May 16, 2014.

[ 11 ]约翰·冯·诺伊曼:“概率逻辑和从不可靠的成分合成可靠的有机体”,《自动机研究》(AM-34),克劳德·E·香农和约翰·麦卡锡编辑,普林斯顿大学出版社,1956年。ISBN:978- 0-691-07916-5

[11] John von Neumann: “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components,” in Automata Studies (AM-34), edited by Claude E. Shannon and John McCarthy, Princeton University Press, 1956. ISBN: 978-0-691-07916-5

[ 12 ] 理查德·W·哈明: 科学与工程的艺术。泰勒和弗朗西斯,1997。ISBN:978-9-056-99500-3

[12] Richard W. Hamming: The Art of Doing Science and Engineering. Taylor & Francis, 1997. ISBN: 978-9-056-99500-3

[ 13 ] Claude E. Shannon:“通信的数学理论”,贝尔系统技术杂志,第 27 卷,第 3 期,第 379-423 和 623-656 页,1948 年 7 月。

[13] Claude E. Shannon: “A Mathematical Theory of Communication,” The Bell System Technical Journal, volume 27, number 3, pages 379–423 and 623–656, July 1948.

[ 14 ] Peter Bailis 和 Kyle Kingsbury:“网络是可靠的”, ACM 队列,第 12 卷,第 7 期,第 48-55 页,2014 年 7 月 。doi:10.1145/2639988.2639988

[14] Peter Bailis and Kyle Kingsbury: “The Network Is Reliable,” ACM Queue, volume 12, number 7, pages 48-55, July 2014. doi:10.1145/2639988.2639988

[ 15 ] Joshua B. Leners、Trinabh Gupta、Marcos K. Aguilera 和 Michael Walfish:“借助网络帮助驯服分布式系统中的不确定性”,第 10 届欧洲计算机系统会议(EuroSys),2015 年 4 月 。doi:10.1145 /2741948.2741976

[15] Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish: “Taming Uncertainty in Distributed Systems with Help from the Network,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741976

[ 16 ] Phillipa Gill、Navendu Jain 和 Nachiappan Nagappan:“了解数据中心的网络故障:测量、分析和影响” , ACM SIGCOMM 会议,2011 年 8 月 。doi:10.1145/2018436.2018477

[16] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan: “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications,” at ACM SIGCOMM Conference, August 2011. doi:10.1145/2018436.2018477

[ 17 ]Mark Imbriaco:“上周六停机”, github.com,2012 年 12 月 26 日。

[17] Mark Imbriaco: “Downtime Last Saturday,” github.com, December 26, 2012.

[ 18 ] Will Oremus:“谷歌证实全球互联网正在遭受鲨鱼攻击”,slate.com,2014 年 8 月 15 日。

[18] Will Oremus: “The Global Internet Is Being Attacked by Sharks, Google Confirms,” slate.com, August 15, 2014.

[ 19 ] Marc A. Donges:“回复:bnx2 卡间歇性脱机”,致 Linux netdev邮件列表的消息,spinics.net,2012 年 9 月 13 日。

[19] Marc A. Donges: “Re: bnx2 cards Intermittantly Going Offline,” Message to Linux netdev mailing list, spinics.net, September 13, 2012.

[ 20 ] Kyle Kingsbury:“ Call Me Maybe:Elasticsearch ”,aphyr.com,2014 年 6 月 15 日。

[20] Kyle Kingsbury: “Call Me Maybe: Elasticsearch,” aphyr.com, June 15, 2014.

[ 21 ] Salvatore Sanfilippo:“关于 Redis Sentinel 属性和失败场景的一些争论”,antirez.com,2014 年 10 月 21 日。

[21] Salvatore Sanfilippo: “A Few Arguments About Redis Sentinel Properties and Fail Scenarios,” antirez.com, October 21, 2014.

[ 22 ] Bert Hubert:“终极 SO_LINGER 页面,或者:为什么我的 TCP 不可靠”,blog.netherlabs.nl,2009 年 1 月 18 日。

[22] Bert Hubert: “The Ultimate SO_LINGER Page, or: Why Is My TCP Not Reliable,” blog.netherlabs.nl, January 18, 2009.

[ 23 ] Nicolas Liochon:“ CAP:如果你拥有的只是超时,那么一切看起来都像分区”,blog.thislongrun.com,2015 年 5 月 25 日。

[23] Nicolas Liochon: “CAP: If All You Have Is a Timeout, Everything Looks Like a Partition,” blog.thislongrun.com, May 25, 2015.

[ 24 ] Jerome H. Saltzer、David P. Reed 和 David D. Clark:“系统设计中的端到端论证”,ACM Transactions on Computer Systems,第 2 卷,第 4 期,第 277-288 页,1984 年 11 月.doi :10.1145/357401.357402

[24] Jerome H. Saltzer, David P. Reed, and David D. Clark: “End-To-End Arguments in System Design,” ACM Transactions on Computer Systems, volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402

[ 25 ] Matthew P. Grosvenor、Malte Schwarzkopf、Ionel Gog 等人:“当您可以跳过队列时,队列就不再重要了!”,第12 届 USENIX 网络系统设计与实现研讨会(NSDI),2015 年 5 月。

[25] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, et al.: “Queues Don’t Matter When You Can JUMP Them!,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

[ 26 ]Guohui Wang 和 TS Eugene Ng:“ The Impact of Virtualization on Network Performance of Amazon EC2 Data Center ”,第 29 届 IEEE 国际计算机通信会议(INFOCOM),2010 年 3 月 。doi:10.1109/INFCOM.2010.5461931

[26] Guohui Wang and T. S. Eugene Ng: “The Impact of Virtualization on Network Performance of Amazon EC2 Data Center,” at 29th IEEE International Conference on Computer Communications (INFOCOM), March 2010. doi:10.1109/INFCOM.2010.5461931

[ 27 ] Van Jacobson:“拥塞避免和控制”,ACM 通信架构和协议研讨会(SIGCOMM),1988 年 8 月 。doi:10.1145/52324.52356

[27] Van Jacobson: “Congestion Avoidance and Control,” at ACM Symposium on Communications Architectures and Protocols (SIGCOMM), August 1988. doi:10.1145/52324.52356

[ 28 ] Brandon Philips:“ etcd:分布式锁定和服务发现”,Strange Loop,2014 年 9 月。

[28] Brandon Philips: “etcd: Distributed Locking and Service Discovery,” at Strange Loop, September 2014.

[ 29 ] Steve Newman:“ EC2 I/O 的系统分析”,blog.scalyr.com,2012 年 10 月 16 日。

[29] Steve Newman: “A Systematic Look at EC2 I/O,” blog.scalyr.com, October 16, 2012.

[ 30 ] Naohiro Hayashibara、Xavier Défago、Rami Yared 和 Takuy​​a Katayama:“ Φ 应计故障检测器”,日本高级科学技术研究所信息科学学院,技术报告 IS-RR-2004-010,2004 年 5 月。

[30] Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama: “The ϕ Accrual Failure Detector,” Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004.

[ 31 ] Jeffrey Wang:“ Phi 应计故障检测器”,ternarysearch.blogspot.co.uk,2013 年 8 月 11 日。

[31] Jeffrey Wang: “Phi Accrual Failure Detector,” ternarysearch.blogspot.co.uk, August 11, 2013.

[ 32 ] Srinivasan Keshav:计算机网络的工程方法:ATM 网络、互联网和电话网络。Addison-Wesley Professional,1997 年 5 月。ISBN:978-0-201-63442-6

[32] Srinivasan Keshav: An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network. Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6

[ 33 ] 思科,“集成服务数字网络”,docwiki.cisco.com

[33] Cisco, “Integrated Services Digital Network,” docwiki.cisco.com.

[ 34 ] Othmar Kyas:ATM 网络。国际汤姆森出版社,1995 年。ISBN:978-1-850-32128-6

[34] Othmar Kyas: ATM Networks. International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6

[ 35 ]“ InfiniBand 常见问题解答”,Mellanox Technologies,2014 年 12 月 22 日。

[35] “InfiniBand FAQ,” Mellanox Technologies, December 22, 2014.

[ 36 ] Jose Renato Santos、Yoshio Turner 和 G. (John) Janakiraman:“ End-to-End Congestion Control for InfiniBand ”,IEEE 计算机和通信协会(INFOCOM) 第 22 届年度联合会议,2003 年 4 月。由 HP 实验室帕洛阿尔托出版,技术报告 HPL-2002-359。 doi:10.1109/INFCOM.2003.1208949

[36] Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman: “End-to-End Congestion Control for InfiniBand,” at 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. doi:10.1109/INFCOM.2003.1208949

[ 37 ] Ulrich Windl、David Dalton、Marc Martinec 和 Dale R. Worley:“ NTP 常见问题解答和 HOWTO ” , ntp.org,2006 年 11 月。

[37] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley: “The NTP FAQ and HOWTO,” ntp.org, November 2006.

[ 38 ] John Graham-Cumming:“闰秒如何以及为何影响 Cloudflare DNS ”,blog.cloudflare.com,2017 年 1 月 1 日。

[38] John Graham-Cumming: “How and why the leap second affected Cloudflare DNS,” blog.cloudflare.com, January 1, 2017.

[ 39 ] David Holmes:“ Hotspot VM 内部:时钟、定时器和调度事件 – 第 I 部分 – Windows ”,blogs.oracle.com,2006 年 10 月 2 日。

[39] David Holmes: “Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows,” blogs.oracle.com, October 2, 2006.

[ 40 ] Steve Loughran:“多核、多套接字服务器时代”,steveloughran.blogspot.co.uk,2015 年 9 月 17 日。

[40] Steve Loughran: “Time on Multi-Core, Multi-Socket Servers,” steveloughran.blogspot.co.uk, September 17, 2015.

[ 41 ] James C. Corbett、Jeffrey Dean、Michael Epstein 等人:“ Spanner:Google 的全球分布式数据库”,第 10 届 USENIX 操作系统设计与实现(OSDI) 研讨会,2012 年 10 月。

[41] James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “Spanner: Google’s Globally-Distributed Database,” at 10th USENIX Symposium on Operating System Design and Implementation (OSDI), October 2012.

[ 42 ] M. Caporaloni 和 R. Ambrosini:“个人计算机时钟通过互联网跟踪 UTC 时间刻度的精确程度如何?”,《欧洲物理学杂志》,第 23 卷,第 4 期,L17–L21 页,2012 年 6 月 。doi:10.1088/0143-0807/23/4/103

[42] M. Caporaloni and R. Ambrosini: “How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet?,” European Journal of Physics, volume 23, number 4, pages L17–L21, June 2012. doi:10.1088/0143-0807/23/4/103

[ 43 ] Nelson Minar:“ NTP 网络调查”, alumni.media.mit.edu,1999 年 12 月。

[43] Nelson Minar: “A Survey of the NTP Network,” alumni.media.mit.edu, December 1999.

[ 44 ] Viliam Holub:“在 Cassandra 集群中同步时钟 Pt. 1 – 问题”,blog.logentries.com,2014 年 3 月 14 日。

[44] Viliam Holub: “Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem,” blog.logentries.com, March 14, 2014.

[ 45 ] Poul-Henning Kamp:“一秒战争(你会在什么时候死?) ”,ACM Queue,第 9 卷,第 4 期,第 44-48 页,2011 年 4 月 。doi:10.1145/1966989.1967009

[45] Poul-Henning Kamp: “The One-Second War (What Time Will You Die?),” ACM Queue, volume 9, number 4, pages 44–48, April 2011. doi:10.1145/1966989.1967009

[ 46 ] Nelson Minar:“闰秒导致半个互联网崩溃”,somebits.com,2012 年 7 月 3 日。

[46] Nelson Minar: “Leap Second Crashes Half the Internet,” somebits.com, July 3, 2012.

[ 47 ] Christopher Pascoe:“时间、技术和跳秒”,googleblog.blogspot.co.uk,2011 年 9 月 15 日。

[47] Christopher Pascoe: “Time, Technology and Leaping Seconds,” googleblog.blogspot.co.uk, September 15, 2011.

[ 48 ] 赵明学和 Jeff Barr:“三思而后行 – 即将到来的闰秒和 AWS ”,aws.amazon.com,2015 年 5 月 18 日。

[48] Mingxue Zhao and Jeff Barr: “Look Before You Leap – The Coming Leap Second and AWS,” aws.amazon.com, May 18, 2015.

[ 49 ] Darryl Veitch 和 Kanthaiah Vijayalayan:“网络计时和 2015 年闰秒”,第 17 届无源和有源测量国际会议(PAM),2016 年 4 月 。doi:10.1007/978-3-319-30505-9_29

[49] Darryl Veitch and Kanthaiah Vijayalayan: “Network Timing and the 2015 Leap Second,” at 17th International Conference on Passive and Active Measurement (PAM), April 2016. doi:10.1007/978-3-319-30505-9_29

[ 50 ]“ VMware 虚拟机中的计时”,信息指南,VMware, Inc.,2011 年 12 月。

[50] “Timekeeping in VMware Virtual Machines,” Information Guide, VMware, Inc., December 2011.

[ 51 ]“ MiFID II / MiFIR:监管技术和实施标准 - 附件一(草案) ”,欧洲证券和市场管理局,报告 ESMA/2015/1464,2015 年 9 月。

[51] “MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I (Draft),” European Securities and Markets Authority, Report ESMA/2015/1464, September 2015.

[ 52 ] Luke Bigum:“以最低支出解决 MiFID II 时钟同步问题(第 1 部分) ”,lmax.com,2015 年 11 月 27 日。

[52] Luke Bigum: “Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1),” lmax.com, November 27, 2015.

[ 53 ] 凯尔·金斯伯里:“也许叫我:卡桑德拉”,aphyr.com,2013 年 9 月 24 日。

[53] Kyle Kingsbury: “Call Me Maybe: Cassandra,” aphyr.com, September 24, 2013.

[ 54 ] John Daily:“时钟很糟糕,或者,欢迎来到分布式系统的奇妙世界”,basho.com,2013 年 11 月 12 日。

[54] John Daily: “Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems,” basho.com, November 12, 2013.

[ 55 ] Kyle Kingsbury:“时间戳的麻烦”,aphyr.com,2013 年 10 月 12 日。

[55] Kyle Kingsbury: “The Trouble with Timestamps,” aphyr.com, October 12, 2013.

[ 56 ] Leslie Lamport:“分布式系统中的时间、时钟和事件顺序”,ACM 通讯,第 21 卷,第 7 期,第 558–565 页,1978 年 7 月 。doi:10.1145/359545.359563

[56] Leslie Lamport: “Time, Clocks, and the Ordering of Events in a Distributed System,” Communications of the ACM, volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

[ 57 ] Sandeep Kulkarni、Murat Demirbas、Deepak Madeppa 等人:“全球分布式数据库中的逻辑物理时钟和一致快照”,纽约州立大学布法罗分校,计算机科学与工程技术报告 2014-04,2014 年 5 月。

[57] Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, et al.: “Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases,” State University of New York at Buffalo, Computer Science and Engineering Technical Report 2014-04, May 2014.

[ 58 ] Justin Sheehy:“没有现在:分布式系统中的并发问题”,ACM Queue,第 13 卷,第 3 期,第 36-41 页,2015 年 3 月 。doi:10.1145/2733108

[58] Justin Sheehy: “There Is No Now: Problems With Simultaneity in Distributed Systems,” ACM Queue, volume 13, number 3, pages 36–41, March 2015. doi:10.1145/2733108

[ 59 ] Murat Demirbas:“ Spanner:Google 的全球分布式数据库”,muratbuffalo.blogspot.co.uk,2013 年 7 月 4 日。

[59] Murat Demirbas: “Spanner: Google’s Globally-Distributed Database,” muratbuffalo.blogspot.co.uk, July 4, 2013.

[ 60 ] Dahlia Malkhi 和 Jean-Philippe Martin:“ Spanner 的并发控制”,ACM SIGACT News,第 44 卷,第 3 期,第 73–77 页,2013 年 9 月 。doi:10.1145/2527748.2527767

[60] Dahlia Malkhi and Jean-Philippe Martin: “Spanner’s Concurrency Control,” ACM SIGACT News, volume 44, number 3, pages 73–77, September 2013. doi:10.1145/2527748.2527767

[ 61 ] Manuel Bravo、Nuno Diegues、Jingna Zeng 等人:“ On the Use of Clocks to Enforce Consistency in the Cloud ”,IEEE 数据工程公告,第 38 卷,第 1 期,第 18-31 页,2015 年 3 月。

[61] Manuel Bravo, Nuno Diegues, Jingna Zeng, et al.: “On the Use of Clocks to Enforce Consistency in the Cloud,” IEEE Data Engineering Bulletin, volume 38, number 1, pages 18–31, March 2015.

[ 62 ] Spencer Kimball:“没有原子钟的生活”,cockroachlabs.com,2016 年 2 月 17 日。

[62] Spencer Kimball: “Living Without Atomic Clocks,” cockroachlabs.com, February 17, 2016.

[ 63 ] Cary G. Gray 和 David R. Cheriton:“租约:分布式文件缓存一致性的高效容错机制”, 第 12 届 ACM 操作系统原理研讨会(SOSP),1989 年 12 月 。doi:10.1145/74850.74870

[63] Cary G. Gray and David R. Cheriton: “Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency,” at 12th ACM Symposium on Operating Systems Principles (SOSP), December 1989. doi:10.1145/74850.74870

[ 64 ] Todd Lipcon:“使用 MemStore 本地分配缓冲区避免 Apache HBase 中的完全 GC:第 1 部分” , blog.cloudera.com,2011 年 2 月 24 日。

[64] Todd Lipcon: “Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1,” blog.cloudera.com, February 24, 2011.

[ 65 ] Martin Thompson:“ Java 垃圾收集蒸馏”,mechanical-sympathy.blogspot.co.uk,2013 年 7 月 16 日。

[65] Martin Thompson: “Java Garbage Collection Distilled,” mechanical-sympathy.blogspot.co.uk, July 16, 2013.

[ 66 ] Alexey Ragozin:“如何抑制 Java GC 暂停?幸存 16GiB 堆及更大堆”,java.dzone.com,2011 年 6 月 28 日。

[66] Alexey Ragozin: “How to Tame Java GC Pauses? Surviving 16GiB Heap and Greater,” java.dzone.com, June 28, 2011.

[ 67 ] Christopher Clark、Keir Fraser、Steven Hand 等人:“虚拟机实时迁移”,第二届 USENIX 网络系统设计与实现(NSDI)研讨会,2005 年 5 月。

[67] Christopher Clark, Keir Fraser, Steven Hand, et al.: “Live Migration of Virtual Machines,” at 2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation (NSDI), May 2005.

[ 68 ] Mike Shaver:“ fsyncers 和 Curveballs ”,shaver.off.net,2008 年 5 月 25 日。

[68] Mike Shaver: “fsyncers and Curveballs,” shaver.off.net, May 25, 2008.

[ 69 ]Zhenyun Zhuang 和 Cuong Tran:“消除后台 IO 流量导致的大型 JVM GC 暂停”,engineering.linkedin.com,2016 年 2 月 10 日。

[69] Zhenyun Zhuang and Cuong Tran: “Eliminating Large JVM GC Pauses Caused by Background IO Traffic,” engineering.linkedin.com, February 10, 2016.

[ 70 ] David Terei 和 Amit Levy:“ Blade:数据中心垃圾收集器”,arXiv:1504.02578,2015 年 4 月 13 日。

[70] David Terei and Amit Levy: “Blade: A Data Center Garbage Collector,” arXiv:1504.02578, April 13, 2015.

[ 71 ] Martin Maas、Tim Harris、Krste Asanović 和 John Kubiatowicz:“垃圾日:协调分布式系统中的垃圾收集”,第 15 届 USENIX 操作系统热门主题研讨会(HotOS),2015 年 5 月。

[71] Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz: “Trash Day: Coordinating Garbage Collection in Distributed Systems,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[ 72 ]“可预测的低延迟”,Cinnober Financial Technology AB,cinnober.com,2013 年 11 月 24 日。

[72] “Predictable Low Latency,” Cinnober Financial Technology AB, cinnober.com, November 24, 2013.

[ 73 ] Martin Fowler:“ LMAX 架构”, martinfowler.com,2011 年 7 月 12 日。

[73] Martin Fowler: “The LMAX Architecture,” martinfowler.com, July 12, 2011.

[ 74 ] Flavio P. Junqueira 和 Benjamin Reed: ZooKeeper:分布式进程协调。奥莱利媒体,2013 年。ISBN:978-1-449-36130-3

[74] Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination. O’Reilly Media, 2013. ISBN: 978-1-449-36130-3

[ 75 ] Enis Söztutar:“ HBase 和 HDFS:了解 HBase 中的文件系统使用”,HBaseCon,2013 年 6 月。

[75] Enis Söztutar: “HBase and HDFS: Understanding Filesystem Usage in HBase,” at HBaseCon, June 2013.

[ 76 ]Caitie McCaffrey:“客户都是混蛋:又名光环 4 在发布时如何提供服务以及我们如何生存”,caitiem.com,2015 年 6 月 23 日。

[76] Caitie McCaffrey: “Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived,” caitiem.com, June 23, 2015.

[ 77 ] Leslie Lamport、Robert Shostak 和 Marshall Pease:“拜占庭将军问题”,ACM 编程语言和系统汇刊(TOPLAS),第 4 卷,第 3 期,第 382–401 页,1982 年 7 月 。doi:10.1145/357172.357176

[77] Leslie Lamport, Robert Shostak, and Marshall Pease: “The Byzantine Generals Problem,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 4, number 3, pages 382–401, July 1982. doi:10.1145/357172.357176

[ 78 ] Jim N. Gray:“关于数据库操作系统的注释”, 《操作系统:高级课程》,计算机科学讲义,第 60 卷,由 R. Bayer、RM Graham 和 G. Seegmüller 编辑,第 393 页–481,施普林格出版社,1978 年。ISBN:978-3-540-08755-7

[78] Jim N. Gray: “Notes on Data Base Operating Systems,” in Operating Systems: An Advanced Course, Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7

[ 79 ]布莱恩·帕尔默:“拜占庭帝国有多复杂?”,slate.com,2011 年 10 月 20 日。

[79] Brian Palmer: “How Complicated Was the Byzantine Empire?,” slate.com, October 20, 2011.

[ 80 ] Leslie Lamport:“我的著作”,research.microsoft.com,2014 年 12 月 16 日。可以通过在网络上搜索删除字符串中连字符而获得的 23 个字符的字符串来找到此页面 allla-mport-spubso-ntheweb

[80] Leslie Lamport: “My Writings,” research.microsoft.com, December 16, 2014. This page can be found by searching the web for the 23-character string obtained by removing the hyphens from the string allla-mport-spubso-ntheweb.

[ 81 ] John Rushby:“安全关键型嵌入式系统的总线架构”,第一届国际嵌入式软件研讨会 (EMSOFT),2001 年 10 月。

[81] John Rushby: “Bus Architectures for Safety-Critical Embedded Systems,” at 1st International Workshop on Embedded Software (EMSOFT), October 2001.

[ 82 ] Jake Edge:“ ELC:SpaceX 的经验教训”,lwn.net,2013 年 3 月 6 日。

[82] Jake Edge: “ELC: SpaceX Lessons Learned,” lwn.net, March 6, 2013.

[ 83 ] Andrew Miller 和 Joseph J. LaViola, Jr.:“来自中等难度谜题的匿名拜占庭共识:比特币模型”,中佛罗里达大学,技术报告 CS-TR-14-01,2014 年 4 月。

[83] Andrew Miller and Joseph J. LaViola, Jr.: “Anonymous Byzantine Consensus from Moderately-Hard Puzzles: A Model for Bitcoin,” University of Central Florida, Technical Report CS-TR-14-01, April 2014.

[ 84 ] James Mickens:“最悲伤的时刻”,USENIX;登录:注销,2013 年 5 月。

[84] James Mickens: “The Saddest Moment,” USENIX ;login: logout, May 2013.

[ 85 ] Evan Gilman:“ Apache ZooKeeper 毒包的发现”,pagerduty.com,2015 年 5 月 7 日。

[85] Evan Gilman: “The Discovery of Apache ZooKeeper’s Poison Packet,” pagerduty.com, May 7, 2015.

[ 86 ] Jonathan Stone 和 Craig Partridge:“当 CRC 和 TCP 校验和不一致时”,ACM 计算机通信应用程序、技术、架构和协议会议(SIGCOMM),2000 年 8 月 。doi:10.1145/347059.347561

[86] Jonathan Stone and Craig Partridge: “When the CRC and TCP Checksum Disagree,” at ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), August 2000. doi:10.1145/347059.347561

[ 87 ] Evan Jones:“ TCP 和以太网校验和如何失败”,evanjones.ca,2015 年 10 月 5 日。

[87] Evan Jones: “How Both TCP and Ethernet Checksums Fail,” evanjones.ca, October 5, 2015.

[ 88 ] Cynthia Dwork、Nancy Lynch 和 Larry Stockmeyer:“ Consensus in the Presence of Partial Synchrony ”,Journal of the ACM,第 35 卷,第 2 期,第 288–323 页,1988 年 4 月。doi:10.1145/42282.42283

[88] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “Consensus in the Presence of Partial Synchrony,” Journal of the ACM, volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283

[ 89 ] Peter Bailis 和 Ali Ghodsi:“当今的最终一致性:限制、扩展及超越” , ACM Queue,第 11 卷,第 3 期,第 55-63 页,2013 年 3 月 。doi:10.1145/2460276.2462076

[89] Peter Bailis and Ali Ghodsi: “Eventual Consistency Today: Limitations, Extensions, and Beyond,” ACM Queue, volume 11, number 3, pages 55-63, March 2013. doi:10.1145/2460276.2462076

[ 90 ] Bowen Alpern 和 Fred B. Schneider:“定义活力”,《 信息处理快报》,第 21 卷,第 4 期,第 181–185 页,1985 年 10 月 。doi:10.1016/0020-0190(85)90056-0

[90] Bowen Alpern and Fred B. Schneider: “Defining Liveness,” Information Processing Letters, volume 21, number 4, pages 181–185, October 1985. doi:10.1016/0020-0190(85)90056-0

[ 91 ] Flavio P. Junqueira:“老兄,我的元数据在哪里?”, fpj.me,2015 年 5 月 28 日。

[91] Flavio P. Junqueira: “Dude, Where’s My Metadata?,” fpj.me, May 28, 2015.

[ 92 ] Scott Sanders:“ 1 月 28 日事件报告”,github.com,2016 年 2 月 3 日。

[92] Scott Sanders: “January 28th Incident Report,” github.com, February 3, 2016.

[ 93 ] Jay Kreps:“关于 Kafka 和 Jepsen 的一些注释”,blog.empathybox.com,2013 年 9 月 25 日。

[93] Jay Kreps: “A Few Notes on Kafka and Jepsen,” blog.empathybox.com, September 25, 2013.

[ 94 ] Thanh Do、Mingzhehao、Tanakorn Leesatapornwongsa 等人:“ Limplock:了解 Limpware 对横向扩展云系统的影响”,第 4 届 ACM 云计算研讨会 (SoCC),2013 年 10 月 。doi:10.1145/ 2523616.2523627

[94] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems,” at 4th ACM Symposium on Cloud Computing (SoCC), October 2013. doi:10.1145/2523616.2523627

[ 95 ] Frank McSherry、Michael Isard 和 Derek G. Murray:“可扩展性!但代价是什么?”,第 15 届 USENIX 操作系统热门话题研讨会(HotOS),2015 年 5 月。

[95] Frank McSherry, Michael Isard, and Derek G. Murray: “Scalability! But at What COST?,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

第 9 章一致性和共识

Chapter 9. Consistency and Consensus

是活着错了好,还是对了死了好?

Jay Kreps,关于卡夫卡和杰普森的一些笔记(2013)

Is it better to be alive and wrong or right and dead?

Jay Kreps, A Few Notes on Kafka and Jepsen (2013)

正如第 8 章所讨论的,分布式系统中很多事情都可能出错。处理此类故障的最简单方法就是让整个服务失败,并向用户显示错误消息。如果该解决方案不可接受,我们需要找到容忍故障的方法,即即使某些内部组件出现故障,也保持服务正常运行。

Lots of things can go wrong in distributed systems, as discussed in Chapter 8. The simplest way of handling such faults is to simply let the entire service fail, and show the user an error message. If that solution is unacceptable, we need to find ways of tolerating faults—that is, of keeping the service functioning correctly, even if some internal component is faulty.

在本章中,我们将讨论一些用于构建容错分布式系统的算法和协议的示例。我们假设第 8 章中的所有问题都可能发生:数据包可能在网络中丢失、重新排序、重复或任意延迟;时钟充其量只是近似值;并且节点可以随时暂停(例如,由于垃圾收集)或崩溃。

In this chapter, we will talk about some examples of algorithms and protocols for building fault-tolerant distributed systems. We will assume that all the problems from Chapter 8 can occur: packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.

构建容错系统的最佳方法是找到一些具有有用保证的通用抽象,实现它们一次,然后让应用程序依赖这些保证。这与我们在第 7 章中使用事务的方法相同:通过使用事务,应用程序可以假装没有崩溃(原子性),没有其他人同时访问数据库(隔离),并且存储设备完美无缺。可靠(耐用)。即使崩溃、竞争条件和磁盘故障确实发生,事务抽象也会隐藏这些问题,以便应用程序不需要担心它们。

The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees. This is the same approach as we used with transactions in Chapter 7: by using a transaction, the application can pretend that there are no crashes (atomicity), that nobody else is concurrently accessing the database (isolation), and that storage devices are perfectly reliable (durability). Even though crashes, race conditions, and disk failures do occur, the transaction abstraction hides those problems so that the application doesn’t need to worry about them.

现在,我们将继续沿着同样的思路,寻求可以允许应用程序忽略分布式系统的一些问题的抽象。例如,分布式系统最重要的抽象之一是共识:即让所有节点就某件事达成一致。正如我们将在本章中看到的,尽管存在网络故障和流程失败,但可靠地达成共识是一个非常棘手的问题。

We will now continue along the same lines, and seek abstractions that can allow an application to ignore some of the problems with distributed systems. For example, one of the most important abstractions for distributed systems is consensus: that is, getting all of the nodes to agree on something. As we shall see in this chapter, reliably reaching consensus in spite of network faults and process failures is a surprisingly tricky problem.

一旦实现了共识,应用程序就可以将其用于各种目的。例如,假设您有一个具有单主复制的数据库。如果领导者死亡并且需要故障转移到另一个节点,则剩余的数据库节点可以使用共识来选举新的领导者。正如“处理节点中断”中所讨论的,重要的是只有一位领导者,并且所有节点都同意谁是领导者。如果两个节点都认为自己是领导者,这种情况称为脑裂,通常会导致数据丢失。正确实施共识有助于避免此类问题。

Once you have an implementation of consensus, applications can use it for various purposes. For example, say you have a database with single-leader replication. If the leader dies and you need to fail over to another node, the remaining database nodes can use consensus to elect a new leader. As discussed in “Handling Node Outages”, it’s important that there is only one leader, and that all nodes agree who the leader is. If two nodes both believe that they are the leader, that situation is called split brain, and it often leads to data loss. Correct implementations of consensus help avoid such problems.

在本章后面的“分布式事务和共识”中,我们将研究 解决共识和相关问题的算法。但首先我们需要探索分布式系统中可以提供的保证和抽象的范围。

Later in this chapter, in “Distributed Transactions and Consensus”, we will look into algorithms to solve consensus and related problems. But first we first need to explore the range of guarantees and abstractions that can be provided in a distributed system.

我们需要了解可以做什么和不能做什么的范围:在某些情况下,系统可以容忍故障并继续工作;在某些情况下,系统可以容忍错误并继续工作;在某些情况下,系统可以容忍错误并继续工作。在其他情况下,这是不可能的。无论是在理论证明还是在实际实现中,都已经深入探讨了什么是可能的、什么是不可能的。我们将在本章中概述这些基本限制。

We need to understand the scope of what can and cannot be done: in some situations, it’s possible for the system to tolerate faults and continue working; in other situations, that is not possible. The limits of what is and isn’t possible have been explored in depth, both in theoretical proofs and in practical implementations. We will get an overview of those fundamental limits in this chapter.

分布式系统领域的研究人员几十年来一直在研究这些主题,因此有很多材料——我们只能触及表面。在本书中,我们没有篇幅详细介绍正式模型和证明的细节,因此我们将坚持非正式的直觉。如果您有兴趣,文献参考可以提供大量额外的深度。

Researchers in the field of distributed systems have been studying these topics for decades, so there is a lot of material—we’ll only be able to scratch the surface. In this book we don’t have space to go into details of the formal models and proofs, so we will stick with informal intuitions. The literature references offer plenty of additional depth if you’re interested.

一致性保证

Consistency Guarantees

“复制滞后问题”中,我们研究了复制数据库中发生的一些计时问题。如果您同时查看两个数据库节点,您可能会在两个节点上看到不同的数据,因为写请求在不同时间到达不同的节点。无论数据库使用哪种复制方法(单领导者、多领导者或无领导者复制),这些不一致都会发生。

In “Problems with Replication Lag” we looked at some timing issues that occur in a replicated database. If you look at two database nodes at the same moment in time, you’re likely to see different data on the two nodes, because write requests arrive on different nodes at different times. These inconsistencies occur no matter what replication method the database uses (single-leader, multi-leader, or leaderless replication).

大多数复制数据库至少提供最终一致性,这意味着如果您停止写入数据库并等待一段未指定的时间,那么最终所有读取请求将返回相同的值[ 1 ]。换句话说,不一致是暂时的,最终会自行解决(假设网络中的任何故障最终也会得到修复)。最终一致性的一个更好的名称可能是 收敛,因为我们期望所有副本最终收敛到相同的值[ 2 ]。

Most replicated databases provide at least eventual consistency, which means that if you stop writing to the database and wait for some unspecified length of time, then eventually all read requests will return the same value [1]. In other words, the inconsistency is temporary, and it eventually resolves itself (assuming that any faults in the network are also eventually repaired). A better name for eventual consistency may be convergence, as we expect all replicas to eventually converge to the same value [2].

然而,这是一个非常弱的保证——它没有说明副本何时收敛。在收敛之前,读取可能返回任何内容,也可能不返回任何内容[ 1 ]。例如,如果您写入一个值,然后立即再次读取它,则不能保证您会看到刚刚写入的值,因为读取可能会路由到不同的副本(请参阅“读取您自己的写入”

However, this is a very weak guarantee—it doesn’t say anything about when the replicas will converge. Until the time of convergence, reads could return anything or nothing [1]. For example, if you write a value and then immediately read it again, there is no guarantee that you will see the value you just wrote, because the read may be routed to a different replica (see “Reading Your Own Writes”).

最终一致性对于应用程序开发人员来说很难,因为它与普通单线程程序中变量的行为非常不同。如果您为变量分配一个值,然后不久之后读取它,则您不会期望读回旧值,或者读取会失败。数据库表面上看起来像一个可以读写的变量,但实际上它具有复杂得多的语义[ 3 ]。

Eventual consistency is hard for application developers because it is so different from the behavior of variables in a normal single-threaded program. If you assign a value to a variable and then read it shortly afterward, you don’t expect to read back the old value, or for the read to fail. A database looks superficially like a variable that you can read and write, but in fact it has much more complicated semantics [3].

当使用仅提供弱保证的数据库时,您需要不断意识到其局限性,并且不要意外地假设太多。错误通常很微妙,很难通过测试发现,因为应用程序在大多数情况下可能运行良好。最终一致性的边缘情况只有在系统出现故障(例如网络中断)或高并发时才变得明显。

When working with a database that provides only weak guarantees, you need to be constantly aware of its limitations and not accidentally assume too much. Bugs are often subtle and hard to find by testing, because the application may work well most of the time. The edge cases of eventual consistency only become apparent when there is a fault in the system (e.g., a network interruption) or at high concurrency.

在本章中,我们将探讨数据系统可能选择提供的更强的一致性模型。它们不是免费的:具有较强保证的系统可能比具有较弱保证的系统具有更差的性能或容错能力。然而,更强的保证可能很有吸引力,因为它们更容易正确使用。一旦您看到了几种不同的一致性模型,您将能够更好地决定哪一种最适合您的需求。

In this chapter we will explore stronger consistency models that data systems may choose to provide. They don’t come for free: systems with stronger guarantees may have worse performance or be less fault-tolerant than systems with weaker guarantees. Nevertheless, stronger guarantees can be appealing because they are easier to use correctly. Once you have seen a few different consistency models, you’ll be in a better position to decide which one best fits your needs.

分布式一致性模型和我们之前讨论的事务隔离级别层次结构之间有一些相似之处 [ 4 , 5 ](请参阅“弱隔离级别”)。虽然存在一些重叠,但它们大多是独立的问题:事务隔离主要是为了避免由于并发执行事务而导致的竞争条件,而分布式一致性主要是为了在面对延迟和故障时协调副本的状态。

There is some similarity between distributed consistency models and the hierarchy of transaction isolation levels we discussed previously [4, 5] (see “Weak Isolation Levels”). But while there is some overlap, they are mostly independent concerns: transaction isolation is primarily about avoiding race conditions due to concurrently executing transactions, whereas distributed consistency is mostly about coordinating the state of replicas in the face of delays and faults.

本章涵盖了广泛的主题,但正如我们将看到的,这些领域实际上是紧密相连的:

This chapter covers a broad range of topics, but as we shall see, these areas are in fact deeply linked:

  • 我们将首先研究常用的最强一致性模型之一: 线性化,并检查其优缺点。

  • We will start by looking at one of the strongest consistency models in common use, linearizability, and examine its pros and cons.

  • 然后,我们将研究分布式系统中事件的排序问题(“排序保证”),特别是围绕因果关系和总排序。

  • We’ll then examine the issue of ordering events in a distributed system (“Ordering Guarantees”), particularly around causality and total ordering.

  • 在第三部分(“分布式事务和共识”)中,我们将探讨如何原子地提交分布式事务,这将最终引导我们找到共识问题的解决方案。

  • In the third section (“Distributed Transactions and Consensus”) we will explore how to atomically commit a distributed transaction, which will finally lead us toward solutions for the consensus problem.

线性度

Linearizability

在最终一致的数据库中,如果您同时向两个不同的副本询问同一问题,您可能会得到两个不同的答案。这很令人困惑。如果数据库能够给人一种只有一个副本(即只有一份数据副本)的错觉,那不是简单很多吗?然后每个客户端都会有相同的数据视图,并且您不必担心复制滞后。

In an eventually consistent database, if you ask two different replicas the same question at the same time, you may get two different answers. That’s confusing. Wouldn’t it be a lot simpler if the database could give the illusion that there is only one replica (i.e., only one copy of the data)? Then every client would have the same view of the data, and you wouldn’t have to worry about replication lag.

这就是线性化 [ 6 ](也称为原子一致性 [ 7 ]、 强一致性立即一致性外部一致性 [ 8 ])背后的想法。线性化的确切定义非常微妙,我们将在本节的其余部分中对其进行探讨。但基本思想是让系统看起来好像只有一份数据副本,并且对其的所有操作都是原子的。有了这个保证,即使现实中可能存在多个副本,应用程序也不需要担心它们。

This is the idea behind linearizability [6] (also known as atomic consistency [7], strong consistency, immediate consistency, or external consistency [8]). The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.

在线性化系统中,一旦一个客户端成功完成写入,所有从数据库读取的客户端都必须能够看到刚刚写入的值。保持数据的单个副本的假象意味着保证读取的值是最新的、最新的值,并且不是来自过时的缓存或副本。换句话说,线性化是新近性保证。为了阐明这个想法,让我们看一个不可线性化系统的示例。

In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. Maintaining the illusion of a single copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica. In other words, linearizability is a recency guarantee. To clarify this idea, let’s look at an example of a system that is not linearizable.

迪迪亚0901
图 9-1。该系统不可线性化,导致球迷感到困惑。

图 9-1显示了一个非线性体育网站的示例 [ 9 ]。Alice 和 Bob 坐在同一个房间,两人都在查看手机,查看 2014 年 FIFA 世界杯决赛的结果。最终比分公布后,Alice 刷新页面,看到获胜者公布,兴奋地告诉 Bob。鲍勃难以置信地在自己的手机上点击了重新加载,但他的请求发送到了一个滞后的数据库副本,因此他的手机显示游戏仍在进行中。

Figure 9-1 shows an example of a nonlinearizable sports website [9]. Alice and Bob are sitting in the same room, both checking their phones to see the outcome of the 2014 FIFA World Cup final. Just after the final score is announced, Alice refreshes the page, sees the winner announced, and excitedly tells Bob about it. Bob incredulously hits reload on his own phone, but his request goes to a database replica that is lagging, and so his phone shows that the game is still ongoing.

如果 Alice 和 Bob 同时点击重新加载,那么他们得到两个不同的查询结果也就不足为奇了,因为他们不知道服务器在什么时间处理他们各自的请求。然而,Bob 知道,在听到 Alice 喊出最终分数后,他按下了重新加载按钮(启动了查询) ,因此他预计他的查询结果至少与 Alice 的查询结果一样新。他的查询返回了过时的结果,这一事实违反了线性化。

If Alice and Bob had hit reload at the same time, it would have been less surprising if they had gotten two different query results, because they wouldn’t know at exactly what time their respective requests were processed by the server. However, Bob knows that he hit the reload button (initiated his query) after he heard Alice exclaim the final score, and therefore he expects his query result to be at least as recent as Alice’s. The fact that his query returned a stale result is a violation of linearizability.

是什么使系统可线性化?

What Makes a System Linearizable?

线性化背后的基本思想很简单:让系统看起来好像只有一个数据副本。然而,准确地确定这意味着什么实际上需要一些小心。为了更好地理解线性化,让我们看一些更多的例子。

The basic idea behind linearizability is simple: to make a system appear as if there is only a single copy of the data. However, nailing down precisely what that means actually requires some care. In order to understand linearizability better, let’s look at some more examples.

图 9-2显示了三个客户端同时在线性化数据库中读取和写入相同的键x 。在分布式系统文献中,x被称为 寄存器——实际上,它可以是键值存储中的一个键、关系数据库中的一行或文档数据库中的一个文档。

Figure 9-2 shows three clients concurrently reading and writing the same key x in a linearizable database. In the distributed systems literature, x is called a register—in practice, it could be one key in a key-value store, one row in a relational database, or one document in a document database, for example.

迪迪亚0902
图 9-2。如果读取请求与写入请求并发,则它可能返回旧值或新值。

为了简单起见,图 9-2仅显示了客户端角度的请求,而不显示数据库的内部结构。每个条形都是客户端发出的请求,其中条形的开始是发送请求的时间,条形的结束是客户端收到响应的时间。由于可变的网络延迟,客户端并不确切知道数据库何时处理其请求,它只知道它一定是在客户端发送请求和接收响应之间的某个时间发生的。

For simplicity, Figure 9-2 shows only the requests from the clients’ point of view, not the internals of the database. Each bar is a request made by a client, where the start of a bar is the time when the request was sent, and the end of a bar is when the response was received by the client. Due to variable network delays, a client doesn’t know exactly when the database processed its request—it only knows that it must have happened sometime between the client sending the request and receiving the response.i

在这个例子中,寄存器有两种类型的操作:

In this example, the register has two types of operations:

  • read ( x )⇒v 表示客户端请求读取寄存器x的值 ,数据库返回值v

  • read(x) ⇒ v means the client requested to read the value of register x, and the database returned the value v.

  • write ( xv ) ⇒  r表示客户端请求将寄存器x设置为值v,并且数据库返回响应r(可能是okerror)。

  • write(xv) ⇒ r means the client requested to set the register x to value v, and the database returned response r (which could be ok or error).

图9-2中, x的值最初为0,客户端C执行写入请求将其设置为1。此时,客户端A和B不断轮询数据库以读取最新值。A 和 B 的读取请求可能得到哪些响应?

In Figure 9-2, the value of x is initially 0, and client C performs a write request to set it to 1. While this is happening, clients A and B are repeatedly polling the database to read the latest value. What are the possible responses that A and B might get for their read requests?

  • 客户端 A 的第一次读操作在写操作开始之前完成,因此它肯定会返回旧值 0。

  • The first read operation by client A completes before the write begins, so it must definitely return the old value 0.

  • 客户端 A 的最后一次读取是在写入完成后开始的,因此如果数据库是可线性化的,它肯定会返回新值 1:我们知道写入操作必须在写入操作开始和结束之间的某个时间被处理,并且必须在读取操作开始和结束之间的某个时间处理读取。如果读取在写入结束后开始,则读取一定是在写入之后处理的,因此它必须看到写入的新值。

  • The last read by client A begins after the write has completed, so it must definitely return the new value 1 if the database is linearizable: we know that the write must have been processed sometime between the start and end of the write operation, and the read must have been processed sometime between the start and end of the read operation. If the read started after the write ended, then the read must have been processed after the write, and therefore it must see the new value that was written.

  • 任何与写操作时间重叠的读操作都可能返回 0 或 1,因为我们不知道在处理读操作时写操作是否已生效。这些操作与写入是并发的。

  • Any read operations that overlap in time with the write operation might return either 0 or 1, because we don’t know whether or not the write has taken effect at the time when the read operation is processed. These operations are concurrent with the write.

然而,这还不足以完全描述线性化:如果与写入并发的读取可以返回旧值或新值,那么读者可以看到值在旧值和新值之间来回翻转数次,而正在进行写入操作。这不是我们对模拟“数据的单一副本”的系统的期望。二、

However, that is not yet sufficient to fully describe linearizability: if reads that are concurrent with a write can return either the old or the new value, then readers could see a value flip back and forth between the old and the new value several times while a write is going on. That is not what we expect of a system that emulates a “single copy of the data.”ii

为了使系统可线性化,我们需要添加另一个约束,如图 9-3所示。

To make the system linearizable, we need to add another constraint, illustrated in Figure 9-3.

直达0903
图 9-3。在任何一次读取返回新值后,所有后续读取(在同一客户端或其他客户端上)也必须返回新值。

在线性化系统中,我们想象必须存在某个时间点(在写入操作的开始和结束之间),此时x的值自动从 0 翻转到 1。因此,如果一个客户端的读取返回新值 1,所有后续读取也必须返回新值,即使写入操作尚未完成。

In a linearizable system we imagine that there must be some point in time (between the start and end of the write operation) at which the value of x atomically flips from 0 to 1. Thus, if one client’s read returns the new value 1, all subsequent reads must also return the new value, even if the write operation has not yet completed.

图 9-3中的箭头说明了这种时序依赖性。客户端 A 是第一个读取新值 1 的人。在 A 的读取返回后,B 开始新的读取。由于 B 的读取严格发生在 A 的读取之后,因此它也必须返回 1,即使 C 的写入仍在进行中。(这与图 9-1中 Alice 和 Bob 的情况相同 :Alice 读取新值后,Bob 也期望读取新值。)

This timing dependency is illustrated with an arrow in Figure 9-3. Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read. Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is still ongoing. (It’s the same situation as with Alice and Bob in Figure 9-1: after Alice has read the new value, Bob also expects to read the new value.)

我们可以进一步细化这个时序图,以可视化在某个时间点以原子方式生效的每个操作。图 9-4 [ 10 ]显示了一个更复杂的示例。

We can further refine this timing diagram to visualize each operation taking effect atomically at some point in time. A more complex example is shown in Figure 9-4 [10].

图9-4, 我们添加了除了读和写之外的第三种操作:

In Figure 9-4 we add a third type of operation besides read and write:

  • cas ( xv oldv new ) ⇒  r表示客户端请求原子比较和设置操作(请参阅“比较和设置”)。如果寄存器x的当前值等于v old,则应自动将其设置为v new。如果 x  ≠  v old,则该操作应保持寄存器不变并返回错误。r是数据库的响应(okerror)。

  • cas(xvoldvnew) ⇒ r means the client requested an atomic compare-and-set operation (see “Compare-and-set”). If the current value of the register x equals vold, it should be atomically set to vnew. If x ≠ vold then the operation should leave the register unchanged and return an error. r is the database’s response (ok or error).

图 9-4中的每个操作都用一条垂直线(在每个操作的条形内部)标记了我们认为该操作被执行的时间。这些标记按顺序连接在一起,结果必须是寄存器的有效读写序列(每次读取都必须返回最近写入设置的值)。

Each operation in Figure 9-4 is marked with a vertical line (inside the bar for each operation) at the time when we think the operation was executed. Those markers are joined up in a sequential order, and the result must be a valid sequence of reads and writes for a register (every read must return the value set by the most recent write).

线性化的要求是连接操作标记的线总是在时间上向前移动(从左到右),而不是向后移动。此要求确保了我们之前讨论的新近度保证:一旦写入或读取了新值,所有后续读取都会看到写入的值,直到它再次被覆盖。

The requirement of linearizability is that the lines joining up the operation markers always move forward in time (from left to right), never backward. This requirement ensures the recency guarantee we discussed earlier: once a new value has been written or read, all subsequent reads see the value that was written, until it is overwritten again.

迪迪亚0904
图 9-4。可视化读取和写入似乎已生效的时间点。B 的最终读取不可线性化。

图 9-4中有一些有趣的细节需要指出:

There are a few interesting details to point out in Figure 9-4:

  • 首先客户端 B 发送了读取x的请求,然后客户端 D 发送了将x设置为 0 的请求,然后客户端 A 发送了将x设置为 1 的请求。然而,B 的读取返回的值是 1(由A)。这是可以的:这意味着数据库首先处理D的写入,然后处理A的写入,最后处理B的读取。尽管这不是请求发送的顺序,但这是一个可接受的顺序,因为这三个请求是并发的。也许B的读请求在网络中略有延迟,所以在两次写入之后才到达数据库。

  • First client B sent a request to read x, then client D sent a request to set x to 0, and then client A sent a request to set x to 1. Nevertheless, the value returned to B’s read is 1 (the value written by A). This is okay: it means that the database first processed D’s write, then A’s write, and finally B’s read. Although this is not the order in which the requests were sent, it’s an acceptable order, because the three requests are concurrent. Perhaps B’s read request was slightly delayed in the network, so it only reached the database after the two writes.

  • 在客户端 A 收到数据库的响应之前,客户端 B 的读取返回了 1,表示值 1 的写入成功。这也没关系:这并不意味着该值在写入之前被读取,它只是意味着从数据库到客户端 A 的ok响应在网络中稍微延迟了。

  • Client B’s read returned 1 before client A received its response from the database, saying that the write of the value 1 was successful. This is also okay: it doesn’t mean the value was read before it was written, it just means the ok response from the database to client A was slightly delayed in the network.

  • 该模型不假设任何事务隔离:另一个客户端可能随时更改值。例如,C先读取1,然后读取2,因为在两次读取之间该值被B改变了。原子比较和设置 ( cas ) 操作可用于检查该值是否未被另一个客户端同时更改:B 和 C 的cas请求成功,但 D 的cas 请求失败(当数据库处理它时,x的值不再是 0)。

  • This model doesn’t assume any transaction isolation: another client may change a value at any time. For example, C first reads 1 and then reads 2, because the value was changed by B between the two reads. An atomic compare-and-set (cas) operation can be used to check the value hasn’t been concurrently changed by another client: B and C’s cas requests succeed, but D’s cas request fails (by the time the database processes it, the value of x is no longer 0).

  • 客户端 B 的最终读取(在阴影栏中)不可线性化。该操作与C的cas write并发,将x从2更新为4。在没有其他请求的情况下,B的读取返回2是可以的。但是,在B的读取开始之前,客户端A已经读取了新值4 ,因此不允许 B 读取比 A 更旧的值。同样,这与图 9-1中的 Alice 和 Bob 的情况相同。

  • The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with C’s cas write, which updates x from 2 to 4. In the absence of other requests, it would be okay for B’s read to return 2. However, client A has already read the new value 4 before B’s read started, so B is not allowed to read an older value than A. Again, it’s the same situation as with Alice and Bob in Figure 9-1.

这就是线性化背后的直觉;正式定义[ 6 ]更准确地描述了它。可以通过记录所有请求和响应的时间并检查它们是否可以排列成有效的顺序来测试系统的行为是否可线性化(尽管计算成本较高)[11 ]

That is the intuition behind linearizability; the formal definition [6] describes it more precisely. It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order [11].

依靠线性化

Relying on Linearizability

线性化在什么情况下有用?查看体育比赛的最终比分可能是一个无聊的例子:在这种情况下,过时几秒钟的结果不太可能造成任何真正的伤害。然而,在某些领域,线性化是使系统正常工作的重要要求。

In what circumstances is linearizability useful? Viewing the final score of a sporting match is perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any real harm in this situation. However, there a few areas in which linearizability is an important requirement for making a system work correctly.

锁定和领导者选举

Locking and leader election

使用单领导者复制的系统需要确保确实只有一个领导者,而不是多个(脑裂)。选举领导者的一种方法是使用锁:每个启动的节点都尝试获取锁,成功的节点将成为领导者[ 14 ]。无论这个锁如何实现,它都必须是线性化的:所有节点必须同意哪个节点拥有该锁;否则是没有用的。

A system that uses single-leader replication needs to ensure that there is indeed only one leader, not several (split brain). One way of electing a leader is to use a lock: every node that starts up tries to acquire the lock, and the one that succeeds becomes the leader [14]. No matter how this lock is implemented, it must be linearizable: all nodes must agree which node owns the lock; otherwise it is useless.

Apache ZooKeeper [ 15 ] 和 etcd [ 16 ] 等协调服务通常用于实现分布式锁和领导者选举。他们使用共识算法以容错方式实现线性化操作(我们将在本章后面的 “容错共识”中讨论此类算法)。iii正确实现锁和领导者选举仍然有 许多微妙的细节(例如,参见“领导者和锁”中的围栏问题),像 Apache Curator [ 17 ] 这样的库通过在 ZooKeeper 之上提供更高级别的配方来提供帮助。然而,线性化存储服务是这些协调任务的基础。

Coordination services like Apache ZooKeeper [15] and etcd [16] are often used to implement distributed locks and leader election. They use consensus algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms later in this chapter, in “Fault-Tolerant Consensus”).iii There are still many subtle details to implementing locks and leader election correctly (see for example the fencing issue in “The leader and the lock”), and libraries like Apache Curator [17] help by providing higher-level recipes on top of ZooKeeper. However, a linearizable storage service is the basic foundation for these coordination tasks.

分布式锁定还在一些分布式数据库中以更细粒度的级别使用,例如 Oracle Real Application Clusters (RAC) [ 18 ]。RAC 对每个磁盘页使用一个锁,多个节点共享对同一磁盘存储系统的访问。由于这些线性化锁位于事务执行的关键路径上,因此 RAC 部署通常具有专用的集群互连网络,用于数据库节点之间的通信。

Distributed locking is also used at a much more granular level in some distributed databases, such as Oracle Real Application Clusters (RAC) [18]. RAC uses a lock per disk page, with multiple nodes sharing access to the same disk storage system. Since these linearizable locks are on the critical path of transaction execution, RAC deployments usually have a dedicated cluster interconnect network for communication between database nodes.

约束和唯一性保证

Constraints and uniqueness guarantees

唯一性约束在数据库中很常见:例如,用户名或电子邮件地址必须唯一标识一个用户,并且在文件存储服务中不能存在两个具有相同路径和文件名的文件。如果您想在写入数据时强制执行此约束(例如,如果两个人尝试同时创建同名的用户或文件,其中一个将返回错误),则需要线性化。

Uniqueness constraints are common in databases: for example, a username or email address must uniquely identify one user, and in a file storage service there cannot be two files with the same path and filename. If you want to enforce this constraint as the data is written (such that if two people try to concurrently create a user or a file with the same name, one of them will be returned an error), you need linearizability.

这种情况实际上类似于锁:当用户注册您的服务时,您可以认为他们获得了对其所选用户名的“锁”。该操作也非常类似于原子比较和设置,将用户名设置为声明该用户名的用户的 ID,前提是该用户名尚未被占用。

This situation is actually similar to a lock: when a user registers for your service, you can think of them acquiring a “lock” on their chosen username. The operation is also very similar to an atomic compare-and-set, setting the username to the ID of the user who claimed it, provided that the username is not already taken.

如果您想确保银行账户余额永远不会出现负数,或者您出售的商品不超过仓库库存,或者两个人不同时预订航班上的同一座位,也会出现类似的问题或在剧院。这些约束都要求有一个所有节点都同意的最新值(账户余额、库存水平、座位占用率)。

Similar issues arise if you want to ensure that a bank account balance never goes negative, or that you don’t sell more items than you have in stock in the warehouse, or that two people don’t concurrently book the same seat on a flight or in a theater. These constraints all require there to be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all nodes agree on.

在实际应用中,有时宽松地对待此类约束是可以接受的(例如,如果航班超额预订,您可以将客户转移到其他航班并为他们带来的不便提供补偿)。在这种情况下,可能不需要线性化,我们将在“及时性和完整性”中讨论这种松散解释的约束。

In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if a flight is overbooked, you can move customers to a different flight and offer them compensation for the inconvenience). In such cases, linearizability may not be needed, and we will discuss such loosely interpreted constraints in “Timeliness and Integrity”.

然而,硬唯一性约束(例如关系数据库中常见的约束)需要线性化。其他类型的约束,例如外键或属性约束,可以在不需要线性化的情况下实现[ 19 ]。

However, a hard uniqueness constraint, such as the one you typically find in relational databases, requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints, can be implemented without requiring linearizability [19].

跨通道时序依赖性

Cross-channel timing dependencies

请注意图 9-1 中的一个细节:如果 Alice 没有喊出分数,Bob 就不会知道他的查询结果已经过时。几秒后他再次刷新页面,最终就看到了最终的成绩。线性化违规只是因为系统中存在额外的通信通道(爱丽丝的声音传到鲍勃的耳朵)而被注意到。

Notice a detail in Figure 9-1: if Alice hadn’t exclaimed the score, Bob wouldn’t have known that the result of his query was stale. He would have just refreshed the page again a few seconds later, and eventually seen the final score. The linearizability violation was only noticed because there was an additional communication channel in the system (Alice’s voice to Bob’s ears).

类似的情况也可能出现在计算机系统中。例如,假设您有一个网站,用户可以在其中上传照片,并且后台进程会将照片大小调整为较低的分辨率,以便更快地下载(缩略图)。该系统的架构和数据流如图 9-5所示。

Similar situations can arise in computer systems. For example, say you have a website where users can upload a photo, and a background process resizes the photos to lower resolution for faster download (thumbnails). The architecture and dataflow of this system is illustrated in Figure 9-5.

需要明确指示图像缩放器执行缩放作业,并且该指令通过消息队列从Web服务器发送到缩放器(参见第11章)。Web 服务器不会将整个照片放入队列中,因为大多数消息代理都是为小消息而设计的,并且一张照片的大小可能有几兆字节。相反,照片首先被写入文件存储服务,一旦写入完成,调整大小的指令就会被放入队列中。

The image resizer needs to be explicitly instructed to perform a resizing job, and this instruction is sent from the web server to the resizer via a message queue (see Chapter 11). The web server doesn’t place the entire photo on the queue, since most message brokers are designed for small messages, and a photo may be several megabytes in size. Instead, the photo is first written to a file storage service, and once the write is complete, the instruction to the resizer is placed on the queue.

直达0905
图 9-5。Web 服务器和图像缩放器通过文件存储和消息队列进行通信,从而可能出现竞争条件。

如果文件存储服务是线性化的,那么这个系统应该可以正常工作。如果它不可线性化,则存在竞争条件的风险:消息队列( 图 9-5中的步骤 3 和 4 )可能比存储服务内的内部复制更快。在这种情况下,当缩放器获取图像(步骤 5)时,它可能会看到旧版本的图像,或者根本看不到任何图像。如果它处理旧版本的图像,则文件存储中的全尺寸图像和调整大小的图像将永久不一致。

If the file storage service is linearizable, then this system should work fine. If it is not linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in Figure 9-5) might be faster than the internal replication inside the storage service. In this case, when the resizer fetches the image (step 5), it might see an old version of the image, or nothing at all. If it processes an old version of the image, the full-size and resized images in the file storage become permanently inconsistent.

出现此问题的原因是 Web 服务器和缩放器之间存在两种不同的通信通道:文件存储和消息队列。如果没有线性化的近期保证,这两个通道之间的竞争条件是可能的。这种情况类似于 图 9-1,其中两个通信通道之间也存在竞争条件:数据库复制以及 Alice 的嘴和 Bob 的耳朵之间的现实音频通道。

This problem arises because there are two different communication channels between the web server and the resizer: the file storage and the message queue. Without the recency guarantee of linearizability, race conditions between these two channels are possible. This situation is analogous to Figure 9-1, where there was also a race condition between two communication channels: the database replication and the real-life audio channel between Alice’s mouth and Bob’s ears.

线性化并不是避免这种竞争条件的唯一方法,但它是最容易理解的方法。如果您控制额外的通信通道(例如消息队列的情况,但不是 Alice 和 Bob 的情况),您可以使用类似于我们在“读取您自己的写入”中讨论的替代方法,但代价是额外的复杂性。

Linearizability is not the only way of avoiding this race condition, but it’s the simplest to understand. If you control the additional communication channel (like in the case of the message queue, but not in the case of Alice and Bob), you can use alternative approaches similar to what we discussed in “Reading Your Own Writes”, at the cost of additional complexity.

实现线性化系统

Implementing Linearizable Systems

现在我们已经了解了一些线性化有用的示例,让我们考虑一下如何实现一个提供线性化语义的系统。

Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we might implement a system that offers linearizable semantics.

由于线性化本质上意味着“表现得好像只有一个数据副本,并且对其进行的所有操作都是原子的”,因此最简单的答案实际上是只使用数据的单个副本。然而,这种方法无法容忍错误:如果保存该副本的节点发生故障,数据将丢失,或者至少在该节点再次启动之前无法访问。

Since linearizability essentially means “behave as though there is only a single copy of the data, and all operations on it are atomic,” the simplest answer would be to really only use a single copy of the data. However, that approach would not be able to tolerate faults: if the node holding that one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.

使系统具有容错能力的最常见方法是使用复制。让我们回顾一下第 5 章中的复制方法,并比较它们是否可以线性化:

The most common approach to making a system fault-tolerant is to use replication. Let’s revisit the replication methods from Chapter 5, and compare whether they can be made linearizable:

单引导复制(可能可线性化)
Single-leader replication (potentially linearizable)

在具有单领导者复制的系统中(请参阅“领导者和追随者”),领导者拥有用于写入的数据的主副本,追随者在其他节点上维护数据的备份副本。如果您从领导者或同步更新的追随者那里进行读取,它们就有可能线性化。iv 然而,并不是每个单领导者数据库实际上都是可线性化的,无论是通过设计(例如,因为它使用快照隔离)还是由于并发错误[ 10 ]。

使用领导者进行读取依赖于您确切知道领导者是谁的假设。正如“真理是由多数人定义的”中所讨论的,节点很可能认为自己是领导者,而事实上它不是——并且如果妄想的领导者继续服务请求,它很可能会违反线性化[ 20 ]。对于异步复制,故障转移甚至可能会丢失已提交的写入(请参阅“处理节点中断”),这违反了持久性和线性化。

In a system with single-leader replication (see “Leaders and Followers”), the leader has the primary copy of the data that is used for writes, and the followers maintain backup copies of the data on other nodes. If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable.iv However, not every single-leader database is actually linearizable, either by design (e.g., because it uses snapshot isolation) or due to concurrency bugs [10].

Using the leader for reads relies on the assumption that you know for sure who the leader is. As discussed in “The Truth Is Defined by the Majority”, it is quite possible for a node to think that it is the leader, when in fact it is not—and if the delusional leader continues to serve requests, it is likely to violate linearizability [20]. With asynchronous replication, failover may even lose committed writes (see “Handling Node Outages”), which violates both durability and linearizability.

共识算法(可线性化)
Consensus algorithms (linearizable)

我们将在本章后面讨论的一些共识算法与单领导者复制相似。然而,共识协议包含防止脑裂和过时副本的措施。得益于这些细节,共识算法可以安全地实现线性化存储。例如,这就是 ZooKeeper [ 21 ] 和 etcd [ 22 ] 的工作原理。

Some consensus algorithms, which we will discuss later in this chapter, bear a resemblance to single-leader replication. However, consensus protocols contain measures to prevent split brain and stale replicas. Thanks to these details, consensus algorithms can implement linearizable storage safely. This is how ZooKeeper [21] and etcd [22] work, for example.

多领导者复制(不可线性化)
Multi-leader replication (not linearizable)

具有多领导者复制的系统通常不可线性化,因为它们同时处理多个节点上的写入并将其异步复制到其他节点。因此,它们可能会产生需要解决的冲突写入(请参阅 “处理写入冲突”)。此类冲突是由于缺乏单个数据副本而造成的。

Systems with multi-leader replication are generally not linearizable, because they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes. For this reason, they can produce conflicting writes that require resolution (see “Handling Write Conflicts”). Such conflicts are an artifact of the lack of a single copy of the data.

无领导者复制(可能不可线性化)
Leaderless replication (probably not linearizable)

对于无领导者复制的系统(Dynamo 风格;请参阅“无领导者复制”),人们有时声称可以通过要求仲裁读取和写入 ( w  +  r > n ) 来获得“强一致性”。根据仲裁的具体配置以及您定义强一致性的方式,这并不完全正确。

基于时钟的“最后写入获胜”冲突解决方法(例如,在 Cassandra 中;请参阅 “依赖同步时钟”)几乎肯定是非线性的,因为无法保证时钟时间戳与实际事件排序一致,因为时钟偏差。草率的群体(“草率的群体和暗示的切换”)也会破坏任何线性化的机会。即使有严格的法定人数,非线性行为也是可能的,如下一节所示。

For systems with leaderless replication (Dynamo-style; see “Leaderless Replication”), people sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes (w + r > n). Depending on the exact configuration of the quorums, and depending on how you define strong consistency, this is not quite true.

“Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra; see “Relying on Synchronized Clocks”) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew. Sloppy quorums (“Sloppy Quorums and Hinted Handoff”) also ruin any chance of linearizability. Even with strict quorums, nonlinearizable behavior is possible, as demonstrated in the next section.

线性化和法定人数

Linearizability and quorums

直观上,严格的仲裁读取和写入似乎应该在 Dynamo 风格的模型中线性化。然而,当我们有可变的网络延迟时,可能会出现竞争条件,如图9-6所示。

Intuitively, it seems as though strict quorum reads and writes should be linearizable in a Dynamo-style model. However, when we have variable network delays, it is possible to have race conditions, as demonstrated in Figure 9-6.

迪迪亚0906
图 9-6。尽管使用严格的法定人数,但仍可实现非线性执行。

图 9-6中, x的初始值为0,写入器客户端 通过将写入发送到所有三个副本(n  = 3,w  = 3)来将x更新为 1。同时,客户端 A 从两个节点的法定数量 ( r  = 2) 中读取数据,并在其中一个节点上看到新值 1。同样在写入的同时,客户端 B 从两个节点的不同仲裁中读取数据,并从两个节点取回旧值 0。

In Figure 9-6, the initial value of x is 0, and a writer client is updating x to 1 by sending the write to all three replicas (n = 3, w = 3). Concurrently, client A reads from a quorum of two nodes (r = 2) and sees the new value 1 on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two nodes, and gets back the old value 0 from both.

满足仲裁条件 ( w  +  r > n ),但此执行仍不可线性化:B 的请求在 A 的请求完成后开始,但 B 返回旧值,而 A 返回新值。(这又是 图 9-1中 Alice 和 Bob 的情况。)

The quorum condition is met (w + r > n), but this execution is nevertheless not linearizable: B’s request begins after A’s request completes, but B returns the old value while A returns the new value. (It’s once again the Alice and Bob situation from Figure 9-1.)

有趣的是,可以以降低性能为代价使 Dynamo 式仲裁线性化:读取器必须在将结果返回到应用程序之前同步执行读取修复(请参阅“读取修复和反熵” )[ 23 ],并且writer 在发送其写入之前必须读取法定节点的最新状态 [ 24 , 25 ]。然而,由于性能损失,Riak 不执行同步读修复[ 26 ]。Cassandra确实会等待仲裁读取上的读取修复完成 [ 27],但如果对同一键有多个并发写入,则由于它使用最后写入获胜冲突解决方案,它会失去线性化能力。

Interestingly, it is possible to make Dynamo-style quorums linearizable at the cost of reduced performance: a reader must perform read repair (see “Read repair and anti-entropy”) synchronously, before returning results to the application [23], and a writer must read the latest state of a quorum of nodes before sending its writes [24, 25]. However, Riak does not perform synchronous read repair due to the performance penalty [26]. Cassandra does wait for read repair to complete on quorum reads [27], but it loses linearizability if there are multiple concurrent writes to the same key, due to its use of last-write-wins conflict resolution.

而且这种方式只能实现线性化的读写操作;线性化的比较和设置操作不能,因为它需要共识算法[ 28 ]。

Moreover, only linearizable read and write operations can be implemented in this way; a linearizable compare-and-set operation cannot, because it requires a consensus algorithm [28].

总之,最安全的假设是具有 Dynamo 式复制的无领导系统不提供线性化。

In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability.

线性化的成本

The Cost of Linearizability

由于某些复制方法可以提供线性化,而其他复制方法则不能,因此更深入地探讨线性化的优缺点是很有趣的。

As some replication methods can provide linearizability and others cannot, it is interesting to explore the pros and cons of linearizability in more depth.

我们已经在第 5 章 中讨论了不同复制方法的一些用例;例如,我们看到多领导者复制通常是多数据中心复制的不错选择(请参阅“多数据中心操作”)。图 9-7展示了此类部署的示例 。

We already discussed some use cases for different replication methods in Chapter 5; for example, we saw that multi-leader replication is often a good choice for multi-datacenter replication (see “Multi-datacenter operation”). An example of such a deployment is illustrated in Figure 9-7.

直达0907
图 9-7。网络中断迫使我们在线性化和可用性之间做出选择。

考虑如果两个数据中心之间出现网络中断会发生什么情况。假设每个数据中心内的网络都正常工作,并且客户端可以访问数据中心,但数据中心之间无法相互连接。

Consider what happens if there is a network interruption between the two datacenters. Let’s assume that the network within each datacenter is working, and clients can reach the datacenters, but the datacenters cannot connect to each other.

使用多主数据库,每个数据中心都可以继续正常运行:由于一个数据中心的写入会异步复制到另一个数据中心,因此当网络连接恢复时,写入会简单地排队并交换。

With a multi-leader database, each datacenter can continue operating normally: since writes from one datacenter are asynchronously replicated to the other, the writes are simply queued up and exchanged when network connectivity is restored.

另一方面,如果使用单领导者复制,则领导者必须位于其中一个数据中心。任何写入和任何线性化读取都必须发送到领导者 - 因此,对于连接到跟随者数据中心的任何客户端,这些读取和写入请求必须通过网络同步发送到领导者数据中心。

On the other hand, if single-leader replication is used, then the leader must be in one of the datacenters. Any writes and any linearizable reads must be sent to the leader—thus, for any clients connected to a follower datacenter, those read and write requests must be sent synchronously over the network to the leader datacenter.

如果在单领导者设置中数据中心之间的网络中断,连接到跟随者数据中心的客户端将无法联系领导者,因此它们无法对数据库进行任何写入,也无法进行任何线性化读取。他们仍然可以从关注者那里读取数据,但它们可能已经过时(非线性化)。如果应用程序需要可线性化的读取和写入,则网络中断会导致应用程序在无法联系领导者的数据中心中变得不可用。

If the network between datacenters is interrupted in a single-leader setup, clients connected to follower datacenters cannot contact the leader, so they cannot make any writes to the database, nor any linearizable reads. They can still make reads from the follower, but they might be stale (nonlinearizable). If the application requires linearizable reads and writes, the network interruption causes the application to become unavailable in the datacenters that cannot contact the leader.

如果客户端可以直接连接到领导者数据中心,这不是问题,因为应用程序可以在那里继续正常工作。但只能到达从属数据中心的客户端将遇到中断,直到网络链路修复为止。

If clients can connect directly to the leader datacenter, this is not a problem, since the application continues to work normally there. But clients that can only reach a follower datacenter will experience an outage until the network link is repaired.

CAP定理

The CAP theorem

这个问题不仅仅是单领导者和多领导者复制的结果:任何线性化数据库都存在这个问题,无论它是如何实现的。该问题也不是多数据中心部署所特有的,而是可能发生在任何不可靠的网络上,甚至在一个数据中心内也是如此。权衡如下:v

This issue is not just a consequence of single-leader and multi-leader replication: any linearizable database has this problem, no matter how it is implemented. The issue also isn’t specific to multi-datacenter deployments, but can occur on any unreliable network, even within one datacenter. The trade-off is as follows:v

  • 如果您的应用程序需要线性化,并且某些副本由于网络问题而与其他副本断开连接,则某些副本在断开连接时无法处理请求:它们必须等待网络问题解决,或者返回错误(无论哪种方式) ,它们变得不可用)。

  • If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process requests while they are disconnected: they must either wait until the network problem is fixed, or return an error (either way, they become unavailable).

  • 如果您的应用程序不需要线性化,那么它可以以每个副本都可以独立处理请求的方式编写,即使它与其他副本(例如,多领导者)断开连接。在这种情况下,应用程序在遇到网络问题时可以保持可用,但其行为不可线性化。

  • If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable.

因此,不需要线性化的应用程序可以更好地容忍网络问题。这种见解通常被称为CAP 定理 [ 29 , 30 , 31 , 32 ],由 Eric Brewer 在 2000 年命名,尽管自 20 世纪 70 年代以来分布式数据库的设计者就已经知道这种权衡 [ 33 , 34 , 35 , 36] ]。

Thus, applications that don’t require linearizability can be more tolerant of network problems. This insight is popularly known as the CAP theorem [29, 30, 31, 32], named by Eric Brewer in 2000, although the trade-off has been known to designers of distributed databases since the 1970s [33, 34, 35, 36].

CAP 最初是作为经验法则提出的,没有精确的定义,目的是引发有关数据库权衡的讨论。当时,许多分布式数据库专注于在具有共享存储的机器集群上提供线性化语义[ 18 ],CAP鼓励数据库工程师探索分布式无共享系统的更广阔的设计空间,这更适合实现大型扩展网络服务[ 37 ]。CAP 因这种文化转变而值得赞扬——见证自 2000 年代中期以来新数据库技术(称为 NoSQL)的爆炸式增长。

CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of starting a discussion about trade-offs in databases. At the time, many distributed databases focused on providing linearizable semantics on a cluster of machines with shared storage [18], and CAP encouraged database engineers to explore a wider design space of distributed shared-nothing systems, which were more suitable for implementing large-scale web services [37]. CAP deserves credit for this culture shift—witness the explosion of new database technologies since the mid-2000s (known as NoSQL).

正式定义的CAP定理[ 30 ]的范围非常狭窄:它只考虑一种一致性模型(即线性化)和一种故障(网络分区vi或存活但彼此断开的节点)。它没有提及任何有关网络延迟、死节点或其他权衡的内容。因此,尽管 CAP 在历史上具有影响力,但它对于设计系统几乎没有实际价值 [ 9 , 40 ]。

The CAP theorem as formally defined [30] is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions,vi or nodes that are alive but disconnected from each other). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems [9, 40].

分布式系统中有许多更有趣的不可能性结果 [ 41 ],而 CAP 现在已被更精确的结果所取代 [ 2 , 42 ],因此它在今天主要具有历史意义。

There are many more interesting impossibility results in distributed systems [41], and CAP has now been superseded by more precise results [2, 42], so it is of mostly historical interest today.

线性度和网络延迟

Linearizability and network delays

尽管线性化是一个有用的保证,但令人惊讶的是,在实践中很少有系统实际上是可线性化的。例如,即使是现代多核 CPU 上的 RAM 也不是可线性化的 [ 43 ]:如果一个 CPU 核心上运行的线程写入一个内存地址,而另一个 CPU 核心上的线程随后不久读取相同的地址,则它是不可线性化的。保证读取第一个线程写入的值(除非使用内存屏障栅栏 [ 44 ])。

Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable in practice. For example, even RAM on a modern multi-core CPU is not linearizable [43]: if a thread running on one CPU core writes to a memory address, and a thread on another CPU core reads the same address shortly afterward, it is not guaranteed to read the value written by the first thread (unless a memory barrier or fence [44] is used).

出现此行为的原因是每个 CPU 核心都有自己的内存高速缓存和存储缓冲区。默认情况下,内存访问首先进入缓存,任何更改都会异步写入主内存。由于访问缓存中的数据比访问主内存 [ 45 ]快得多,因此此功能对于现代 CPU 的良好性能至关重要。然而,现在数据有多个副本(一个在主存中,也许还有多个在各个缓存中),并且这些副本是异步更新的,因此线性化能力丢失了。

The reason for this behavior is that every CPU core has its own memory cache and store buffer. Memory access first goes to the cache by default, and any changes are asynchronously written out to main memory. Since accessing data in the cache is much faster than going to main memory [45], this feature is essential for good performance on modern CPUs. However, there are now several copies of the data (one in main memory, and perhaps several more in various caches), and these copies are asynchronously updated, so linearizability is lost.

为什么要进行这种权衡?用CAP定理来证明多核内存一致性模型是没有意义的:在一台计算机内,我们通常假设通信可靠,并且我们不期望某个CPU核在与主机断开连接的情况下能够继续正常运行。计算机的其余部分。线性化下降的原因是性能,而不是容错性。

Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory consistency model: within one computer we usually assume reliable communication, and we don’t expect one CPU core to be able to continue operating normally if it is disconnected from the rest of the computer. The reason for dropping linearizability is performance, not fault tolerance.

许多选择不提供线性化保证的分布式数据库也是如此:它们这样做主要是为了提高性能,而不是为了容错性[ 46 ]。线性化速度很慢——这一直都是如此,不仅仅是在网络故障期间。

The same is true of many distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not so much for fault tolerance [46]. Linearizability is slow—and this is true all the time, not only during a network fault.

我们难道不能找到一种更有效的线性化存储实现吗?看来答案是否定的:Attiya 和 Welch [ 47 ] 证明,如果你想要线性化,读写请求的响应时间至少与网络延迟的不确定性成正比。在具有高度可变延迟的网络中,如大多数计算机网络(请参阅“超时和无界延迟”),线性化读取和写入的响应时间不可避免地会很高。不存在更快的线性化算法,但较弱的一致性模型可以更快,因此这种权衡对于延迟敏感的系统很重要。在第 12 章中,我们将讨论一些在不牺牲正确性的情况下避免线性化的方法。

Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is no: Attiya and Welch [47] prove that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays, like most computer networks (see “Timeouts and Unbounded Delays”), the response time of linearizable reads and writes is inevitably going to be high. A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster, so this trade-off is important for latency-sensitive systems. In Chapter 12 we will discuss some approaches for avoiding linearizability without sacrificing correctness.

订购保证

Ordering Guarantees

我们之前说过,线性化寄存器的行为就好像只有一个数据副本,并且每个操作似乎在一个时间点以原子方式生效。这个定义意味着操作是按照某种明确定义的顺序执行的。我们通过按照操作执行的顺序将操作连接起来来说明图 9-4中的顺序。

We said previously that a linearizable register behaves as if there is only a single copy of the data, and that every operation appears to take effect atomically at one point in time. This definition implies that operations are executed in some well-defined order. We illustrated the ordering in Figure 9-4 by joining up the operations in the order in which they seem to have executed.

排序是本书中反复出现的主题,这表明它可能是一个重要的基本思想。让我们简要回顾一下我们讨论排序的其他一些上下文:

Ordering has been a recurring theme in this book, which suggests that it might be an important fundamental idea. Let’s briefly recap some of the other contexts in which we have discussed ordering:

  • 第 5 章中,我们看到单领导者复制中领导者的主要目的是确定复制日志中的写入顺序,即追随者应用这些写入的顺序。如果没有单个领导者,可能会因并发操作而发生冲突(请参阅“处理写入冲突”)。

  • In Chapter 5 we saw that the main purpose of the leader in single-leader replication is to determine the order of writes in the replication log—that is, the order in which followers apply those writes. If there is no single leader, conflicts can occur due to concurrent operations (see “Handling Write Conflicts”).

  • 我们在第 7 章中讨论的可串行性是为了确保事务的行为就像它们是按某种顺序执行的一样。它可以通过按顺序执行事务来实现,或者通过允许并发执行同时防止序列化冲突(通过锁定或中止)来实现。

  • Serializability, which we discussed in Chapter 7, is about ensuring that transactions behave as if they were executed in some sequential order. It can be achieved by literally executing transactions in that serial order, or by allowing concurrent execution while preventing serialization conflicts (by locking or aborting).

  • 我们在第 8 章中讨论的分布式系统中时间戳和时钟的使用 (请参阅“依赖同步时钟”)是另一种将秩序引入无序世界的尝试,例如确定两个写入中的哪一个稍后发生。

  • The use of timestamps and clocks in distributed systems that we discussed in Chapter 8 (see “Relying on Synchronized Clocks”) is another attempt to introduce order into a disorderly world, for example to determine which one of two writes happened later.

事实证明,排序、线性化和共识之间存在着深刻的联系。尽管这个概念比本书的其余部分更加理论和抽象,但它对于澄清我们对系统可以做什么和不能做什么的理解非常有帮助。我们将在接下来的几节中探讨这个主题。

It turns out that there are deep connections between ordering, linearizability, and consensus. Although this notion is a bit more theoretical and abstract than the rest of this book, it is very helpful for clarifying our understanding of what systems can and cannot do. We will explore this topic in the next few sections.

顺序和因果关系

Ordering and Causality

排序不断出现的原因有几个,其中之一是它有助于保持因果关系。在本书中,我们已经看到了几个因果关系很重要的例子:

There are several reasons why ordering keeps coming up, and one of the reasons is that it helps preserve causality. We have already seen several examples over the course of this book where causality has been important:

  • “一致的前缀读取”图 5-5)中,我们看到一个例子,对话的观察者首先看到问题的答案,然后看到正在回答的问题。这是令人困惑的,因为它违反了我们对因果关系的直觉:如果一个问题得到了回答,那么显然这个问题必须首先出现,因为给出答案的人一定已经看到了这个问题(假设他们没有通灵能力并且无法洞察问题)。未来)。我们说问题和答案之间存在因果关系。

  • In “Consistent Prefix Reads” (Figure 5-5) we saw an example where the observer of a conversation saw first the answer to a question, and then the question being answered. This is confusing because it violates our intuition of cause and effect: if a question is answered, then clearly the question had to be there first, because the person giving the answer must have seen the question (assuming they are not psychic and cannot see into the future). We say that there is a causal dependency between the question and the answer.

  • 图 5-9中出现了类似的模式,我们查看了三个领导者之间的复制,并注意到由于网络延迟,某些写入可能“超过”其他写入。从其中一个副本的角度来看,似乎对不存在的行进行了更新。这里的因果关系意味着必须先创建一行,然后才能更新它。

  • A similar pattern appeared in Figure 5-9, where we looked at the replication between three leaders and noticed that some writes could “overtake” others due to network delays. From the perspective of one of the replicas it would look as though there was an update to a row that did not exist. Causality here means that a row must first be created before it can be updated.

  • “检测并发写入”中,我们观察到,如果有两个操作 A 和 B,则存在三种可能性:A 发生在 B 之前,或者 B 发生在 A 之前,或者 A 和 B 是并发的。这种发生在关系之前的情况是因果关系的另一种表达:如果A发生在B之前,则意味着B可能知道A,或者建立在A之上,或者依赖于A。如果A和B同时发生,则它们之间不存在因果关系;换句话说,我们确信双方都不知道对方。

  • In “Detecting Concurrent Writes” we observed that if you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. This happened before relationship is another expression of causality: if A happened before B, that means B might have known about A, or built upon A, or depended on A. If A and B are concurrent, there is no causal link between them; in other words, we are sure that neither knew about the other.

  • 在事务的快照隔离( “快照隔离和可重复读取” ) 的上下文中,我们说事务从一致的快照中读取。但在这种情况下“一致”意味着什么?这意味着与因果关系一致:如果快照包含答案,它也必须包含正在回答的问题[ 48 ]。在单个时间点观察整个数据库使其符合因果关系:在该时间点之前因果发生的所有操作的影响是可见的,但之后因果发生的操作却看不到。 读倾斜(不可重复读,如图7-6所示)是指在违反因果关系的状态下读取数据。

  • In the context of snapshot isolation for transactions (“Snapshot Isolation and Repeatable Read”), we said that a transaction reads from a consistent snapshot. But what does “consistent” mean in this context? It means consistent with causality: if the snapshot contains an answer, it must also contain the question being answered [48]. Observing the entire database at a single point in time makes it consistent with causality: the effects of all operations that happened causally before that point in time are visible, but no operations that happened causally afterward can be seen. Read skew (non-repeatable reads, as illustrated in Figure 7-6) means reading data in a state that violates causality.

  • 我们的事务之间写入偏差的示例(请参阅“写入偏差和幻像”)也证明了因果依赖性:在图 7-8中,Alice 被允许下班,因为事务认为 Bob 仍在待命,反之亦然。在这种情况下,下班的行为因果上取决于对当前正在通话的人的观察。可串行快照隔离(请参阅“可串行快照隔离 (SSI)”)通过跟踪事务之间的因果依赖性来检测写入偏差。

  • Our examples of write skew between transactions (see “Write Skew and Phantoms”) also demonstrated causal dependencies: in Figure 7-8, Alice was allowed to go off call because the transaction thought that Bob was still on call, and vice versa. In this case, the action of going off call is causally dependent on the observation of who is currently on call. Serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)”) detects write skew by tracking the causal dependencies between transactions.

  • 在 Alice 和 Bob 看足球的例子中(图 9-1),Bob 在听到 Alice 感叹结果后从服务器得到了一个陈旧的结果,这一事实违反了因果关系:Alice 的感叹与比分的公布有因果关系,所以鲍勃在听到爱丽丝的声音后也应该能够看到分数。同样的模式以图像调整大小服务的形式再次出现在“跨通道时序依赖性”中。

  • In the example of Alice and Bob watching football (Figure 9-1), the fact that Bob got a stale result from the server after hearing Alice exclaim the result is a causality violation: Alice’s exclamation is causally dependent on the announcement of the score, so Bob should also be able to see the score after hearing Alice. The same pattern appeared again in “Cross-channel timing dependencies” in the guise of an image resizing service.

因果关系对事件施加了顺序:原因先于结果;消息在接收之前发送;问题先于答案。而且,就像在现实生活中一样,一件事会导致另一件事:一个节点读取一些数据,然后写入一些结果,另一个节点读取已写入的数据,然后依次写入其他数据,依此类推。这些因果相关的操作链定义了系统中的因果顺序,即先发生什么。

Causality imposes an ordering on events: cause comes before effect; a message is sent before that message is received; the question comes before the answer. And, like in real life, one thing leads to another: one node reads some data and then writes something as a result, another node reads the thing that was written and writes something else in turn, and so on. These chains of causally dependent operations define the causal order in the system—i.e., what happened before what.

如果一个系统遵循因果关系所施加的顺序,我们就说它是因果一致的。例如,快照隔离提供了因果一致性:当您从数据库读取数据时,您看到一些数据,那么您还必须能够看到因果上位于它之前的任何数据(假设它同时没有被删除)。

If a system obeys the ordering imposed by causality, we say that it is causally consistent. For example, snapshot isolation provides causal consistency: when you read from the database, and you see some piece of data, then you must also be able to see any data that causally precedes it (assuming it has not been deleted in the meantime).

因果顺序不是全序

The causal order is not a total order

允许比较任意两个元素,因此如果有两个元素,您始终可以判断哪个元素更大、哪个元素更小。例如,自然数是完全有序的:如果我给你任意两个数字,比如 5 和 13,你可以告诉我 13 大于 5。

A total order allows any two elements to be compared, so if you have two elements, you can always say which one is greater and which one is smaller. For example, natural numbers are totally ordered: if I give you any two numbers, say 5 and 13, you can tell me that 13 is greater than 5.

然而,数学集合并不是完全有序的:{ ab } 大于 { bc } 吗?嗯,你无法真正比​​较它们,因为它们都不是另一个的子集。我们说它们是不可比较的,因此数学集合是部分有序的:在某些情况下,一个集合大于另一个集合(如果一个集合包含另一个集合的所有元素),但在其他情况下,它们是不可比较的。

However, mathematical sets are not totally ordered: is {ab} greater than {bc}? Well, you can’t really compare them, because neither is a subset of the other. We say they are incomparable, and therefore mathematical sets are partially ordered: in some cases one set is greater than another (if one set contains all the elements of another), but in other cases they are incomparable.

全序和偏序的区别体现在不同的数据库一致性模型上:

The difference between a total order and a partial order is reflected in different database consistency models:

线性度
Linearizability

在线性化系统中,我们有一个总的操作顺序:如果系统的行为就像只有一个数据副本,并且每个操作都是原子的,这意味着对于任何两个操作,我们总是可以说哪一个先发生。这种总排序如图 9-4中的时间线所示。

In a linearizable system, we have a total order of operations: if the system behaves as if there is only a single copy of the data, and every operation is atomic, this means that for any two operations we can always say which one happened first. This total ordering is illustrated as a timeline in Figure 9-4.

因果关系
Causality

我们说,如果两个操作都不发生在另一个操作之前,则两个操作是并发的(请参阅 ““发生在”之前的关系和并发”)。换句话说,如果两个事件有因果关系(一个发生在另一个之前发生),则它们是有序的,但如果它们同时发生,则它们是不可比较的。这意味着因果关系定义了偏序,而不是全序:某些操作是相互排序的,但有些操作是不可比较的。

We said that two operations are concurrent if neither happened before the other (see “The “happens-before” relationship and concurrency”). Put another way, two events are ordered if they are causally related (one happened before the other), but they are incomparable if they are concurrent. This means that causality defines a partial order, not a total order: some operations are ordered with respect to each other, but some are incomparable.

因此,根据此定义,可线性化数据存储中不存在并发操作:必须有一个时间线,所有操作都沿着该时间线完全排序。可能有多个请求等待处理,但数据存储可确保每个请求在单个时间点以原子方式处理,沿着单个时间线作用于单个数据副本,没有任何并发​​性。

Therefore, according to this definition, there are no concurrent operations in a linearizable datastore: there must be a single timeline along which all operations are totally ordered. There might be several requests waiting to be handled, but the datastore ensures that every request is handled atomically at a single point in time, acting on a single copy of the data, along a single timeline, without any concurrency.

并发意味着时间线再次分支和合并,在这种情况下,不同分支上的操作是不可比较的(即并发)。我们在第 5 章中看到了这种现象 :例如,图 5-14并不是一个直线全序,而是同时进行的不同操作的混乱。图中的箭头表示因果依赖性——操作的部分顺序。

Concurrency would mean that the timeline branches and merges again—and in this case, operations on different branches are incomparable (i.e., concurrent). We saw this phenomenon in Chapter 5: for example, Figure 5-14 is not a straight-line total order, but rather a jumble of different operations going on concurrently. The arrows in the diagram indicate causal dependencies—the partial ordering of operations.

如果您熟悉 Git 等分布式版本控制系统,它们的版本历史非常类似于因果依赖关系图。通常,一个提交接连发生,呈一条直线,但有时您会得到分支(当几个人同时处理一个项目时),并且当合并这些同时创建的提交时,会创建合并。

If you are familiar with distributed version control systems such as Git, their version histories are very much like the graph of causal dependencies. Often one commit happens after another, in a straight line, but sometimes you get branches (when several people concurrently work on a project), and merges are created when those concurrently created commits are combined.

线性化能力强于因果一致性

Linearizability is stronger than causal consistency

那么因果顺序和线性化之间有什么关系呢?答案是,线性化意味着因果关系:任何可线性化的系统都将正确保留因果关系[ 7 ]。特别是,如果系统中有多个通信通道(如图9-5中的消息队列和文件存储服务),线性化可确保自动保留因果关系,而无需系统执行任何特殊操作(例如传递消息)不同组件之间的时间戳)。

So what is the relationship between the causal order and linearizability? The answer is that linearizability implies causality: any system that is linearizable will preserve causality correctly [7]. In particular, if there are multiple communication channels in a system (such as the message queue and the file storage service in Figure 9-5), linearizability ensures that causality is automatically preserved without the system having to do anything special (such as passing around timestamps between different components).

线性化确保了因果关系,这一事实使得线性化系统易于理解且有吸引力。然而,正如“线性化的成本”中所讨论的,使系统线性化可能会损害其性能和可用性,特别是如果系统具有显着的网络延迟(例如,如果它是地理分布式的)。因此,一些分布式数据系统放弃了线性化,这使它们能够实现更好的性能,但可能使它们难以使用。

The fact that linearizability ensures causality is what makes linearizable systems simple to understand and appealing. However, as discussed in “The Cost of Linearizability”, making a system linearizable can harm its performance and availability, especially if the system has significant network delays (for example, if it’s geographically distributed). For this reason, some distributed data systems have abandoned linearizability, which allows them to achieve better performance but can make them difficult to work with.

好消息是,中间立场是可能的。线性化并不是保留因果关系的唯一方法——还有其他方法。系统可以因果一致,而不会因使其线性化而导致性能下降(特别是,CAP 定理不适用)。事实上,因果一致性是最强的一致性模型,它不会因网络延迟而减慢速度,并且在面对网络故障时仍然可用[ 2 , 42 ]。

The good news is that a middle ground is possible. Linearizability is not the only way of preserving causality—there are other ways too. A system can be causally consistent without incurring the performance hit of making it linearizable (in particular, the CAP theorem does not apply). In fact, causal consistency is the strongest possible consistency model that does not slow down due to network delays, and remains available in the face of network failures [2, 42].

在许多情况下,看似需要线性化的系统实际上只需要因果一致性,这可以更有效地实现。基于这一观察 ,研究 人员正在探索保留因果关系的新型数据库,其性能和可用性特征与最终一致系统性能和可用性特征相似[ 49、50、51 ]。

In many cases, systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently. Based on this observation, researchers are exploring new kinds of databases that preserve causality, with performance and availability characteristics that are similar to those of eventually consistent systems [49, 50, 51].

由于这项研究是最近才进行的,因此尚未进入生产系统,并且仍然存在需要克服的挑战 [ 52 , 53 ]。然而,这对于未来的系统来说是一个有前途的方向。

As this research is quite recent, not much of it has yet made its way into production systems, and there are still challenges to be overcome [52, 53]. However, it is a promising direction for future systems.

捕获因果依赖性

Capturing causal dependencies

我们不会在这里详细讨论非线性系统如何保持因果一致性的所有细节,而只是简单地探讨一些关键思想。

We won’t go into all the nitty-gritty details of how nonlinearizable systems can maintain causal consistency here, but just briefly explore some of the key ideas.

为了保持因果关系,您需要知道哪个操作发生在哪个操作之前。这是部分顺序:并发操作可以按任何顺序处理,但如果一个操作发生在另一个操作之前,则必须在每个副本上按该顺序处理它们。因此,当副本处理一个操作时,它必须确保所有因果上的操作(之前发生的所有操作)都已被处理;如果前面的某个操作丢失,则后面的操作必须等待前面的操作处理完毕。

In order to maintain causality, you need to know which operation happened before which other operation. This is a partial order: concurrent operations may be processed in any order, but if one operation happened before another, then they must be processed in that order on every replica. Thus, when a replica processes an operation, it must ensure that all causally preceding operations (all operations that happened before) have already been processed; if some preceding operation is missing, the later operation must wait until the preceding operation has been processed.

为了确定因果依赖性,我们需要某种方式来描述系统中节点的“知识”。如果节点在发出写入 Y 时已经看到值 X,则 X 和 Y 可能存在因果关系。该分析使用了欺诈指控刑事调查中常见的问题类型:首席执行官在做出 Y 决定时是否了解X?

In order to determine causal dependencies, we need some way of describing the “knowledge” of a node in the system. If a node had already seen the value X when it issued the write Y, then X and Y may be causally related. The analysis uses the kinds of questions you would expect in a criminal investigation of fraud charges: did the CEO know about X at the time when they made decision Y?

确定哪个操作发生在哪个操作之前的技术与我们在“检测并发写入” 中讨论的技术类似。该部分讨论了无领导者数据存储中的因果关系,其中我们需要检测对同一键的并发写入,以防止丢失更新。因果一致性更进一步:它需要跟踪整个数据库的因果依赖关系,而不仅仅是单个键的依赖关系。版本向量可以推广到做到这一点[ 54 ]。

The techniques for determining which operation happened before which other operation are similar to what we discussed in “Detecting Concurrent Writes”. That section discussed causality in a leaderless datastore, where we need to detect concurrent writes to the same key in order to prevent lost updates. Causal consistency goes further: it needs to track causal dependencies across the entire database, not just for a single key. Version vectors can be generalized to do this [54].

为了确定因果顺序,数据库需要知道应用程序读取了哪个版本的数据。这就是为什么在图 5-13中,先前操作的版本号在写入时被传回数据库。类似的想法也出现在SSI的冲突检测中,正如《可串行化快照隔离(SSI)》中所讨论的:当事务要提交时,数据库会检查它读取的数据版本是否仍然是最新的。为此,数据库会跟踪哪个事务读取了哪些数据。

In order to determine the causal ordering, the database needs to know which version of the data was read by the application. This is why, in Figure 5-13, the version number from the prior operation is passed back to the database on a write. A similar idea appears in the conflict detection of SSI, as discussed in “Serializable Snapshot Isolation (SSI)”: when a transaction wants to commit, the database checks whether the version of the data that it read is still up to date. To this end, the database keeps track of which data has been read by which transaction.

序列号订购

Sequence Number Ordering

尽管因果关系是一个重要的理论概念,但实际上跟踪所有因果依赖性可能变得不切实际。在许多应用程序中,客户端在写入数据之前会读取大量数据,然后就不清楚写入是否因果关系依赖于所有或仅部分先前的读取。显式跟踪所有已读取的数据意味着很大的开销。

Although causality is an important theoretical concept, actually keeping track of all causal dependencies can become impractical. In many applications, clients read lots of data before writing something, and then it is not clear whether the write is causally dependent on all or only some of those prior reads. Explicitly tracking all the data that has been read would mean a large overhead.

然而,有更好的方法:我们可以使用序列号时间戳来对事件进行排序。时间戳不必来自时钟(或物理时钟,它们有很多问题,如“不可靠的时钟”中讨论的)。相反,它可以来自逻辑时钟,这是一种生成数字序列来识别操作的算法,通常使用针对每个操作递增的计数器。

However, there is a better way: we can use sequence numbers or timestamps to order events. A timestamp need not come from a time-of-day clock (or physical clock, which have many problems, as discussed in “Unreliable Clocks”). It can instead come from a logical clock, which is an algorithm to generate a sequence of numbers to identify operations, typically using counters that are incremented for every operation.

此类序列号或时间戳是紧凑的(大小只有几个字节),并且它们提供了全序 即每个操作都有唯一的序列号,并且您始终可以比较两个序列号以确定哪个更大(即,哪个操作稍后发生)。

Such sequence numbers or timestamps are compact (only a few bytes in size), and they provide a total order: that is, every operation has a unique sequence number, and you can always compare two sequence numbers to determine which is greater (i.e., which operation happened later).

特别是,我们可以按照与因果关系一致的 全序创建序列号 :vii 我们承诺,如果操作 A 在因果上发生在 B 之前,则 A 在全序中发生在 B 之前(A 的序列号比 B 低)。并发操作可以任意排序。这样的全序捕获了所有因果关系信息,但也强加了比因果关系严格要求的更多的排序。

In particular, we can create sequence numbers in a total order that is consistent with causality:vii we promise that if operation A causally happened before B, then A occurs before B in the total order (A has a lower sequence number than B). Concurrent operations may be ordered arbitrarily. Such a total order captures all the causality information, but also imposes more ordering than strictly required by causality.

在具有单领导者复制的数据库中(请参阅“领导者和追随者”),复制日志定义了与因果关系一致的写操作的总顺序。领导者可以简单地为每个操作增加一个计数器,从而为复制日志中的每个操作分配一个单调递增的序列号。如果追随者按照复制日志中出现的顺序应用写入,则追随者的状态始终因果一致(即使它落后于领导者)。

In a database with single-leader replication (see “Leaders and Followers”), the replication log defines a total order of write operations that is consistent with causality. The leader can simply increment a counter for each operation, and thus assign a monotonically increasing sequence number to each operation in the replication log. If a follower applies the writes in the order they appear in the replication log, the state of the follower is always causally consistent (even if it is lagging behind the leader).

非因果序列号生成器

Noncausal sequence number generators

如果没有单个领导者(可能是因为您使用的是多领导者或无领导者数据库,或者因为数据库已分区),则不太清楚如何生成操作的序列号。实践中使用了多种方法:

If there is not a single leader (perhaps because you are using a multi-leader or leaderless database, or because the database is partitioned), it is less clear how to generate sequence numbers for operations. Various methods are used in practice:

  • 每个节点都可以生成自己独立的一组序列号。例如,如果有两个节点,一个节点只能生成奇数,另一个节点只能生成偶数。一般来说,您可以在序列号的二进制表示中保留一些位来包含唯一的节点标识符,这将确保两个不同的节点永远不会生成相同的序列号。

  • Each node can generate its own independent set of sequence numbers. For example, if you have two nodes, one node can generate only odd numbers and the other only even numbers. In general, you could reserve some bits in the binary representation of the sequence number to contain a unique node identifier, and this would ensure that two different nodes can never generate the same sequence number.

  • 您可以将来自时钟(物理时钟)的时间戳附加到每个操作 [ 55 ]。此类时间戳不是连续的,但如果它们具有足够高的分辨率,则它们可能足以完全排序操作。这一事实用于最后写入获胜冲突解决方法(请参阅 “排序事件的时间戳”)。

  • You can attach a timestamp from a time-of-day clock (physical clock) to each operation [55]. Such timestamps are not sequential, but if they have sufficiently high resolution, they might be sufficient to totally order operations. This fact is used in the last write wins conflict resolution method (see “Timestamps for ordering events”).

  • 您可以预先分配序列号块。例如,节点 A 可能声明序列号从 1 到 1,000 的块,而节点 B 可能声明序列号从 1,001 到 2,000 的块。然后,每个节点可以独立地从其块中分配序列号,并在其序列号供应开始不足时分配新块。

  • You can preallocate blocks of sequence numbers. For example, node A might claim the block of sequence numbers from 1 to 1,000, and node B might claim the block from 1,001 to 2,000. Then each node can independently assign sequence numbers from its block, and allocate a new block when its supply of sequence numbers begins to run low.

与通过递增计数器的单个领导者推送所有操作相比,这三个选项的性能都更好,并且更具可扩展性。它们为每个操作生成一个唯一的、近似递增的序列号。然而,它们都有一个问题:它们生成的序列号 与因果关系不一致

These three options all perform better and are more scalable than pushing all operations through a single leader that increments a counter. They generate a unique, approximately increasing sequence number for each operation. However, they all have a problem: the sequence numbers they generate are not consistent with causality.

出现因果关系问题是因为这些序列号生成器无法正确捕获不同节点之间的操作顺序:

The causality problems occur because these sequence number generators do not correctly capture the ordering of operations across different nodes:

  • 每个节点每秒可以处理不同数量的操作。因此,如果一个节点生成偶数而另一节点生成奇数,则偶数的计数器可能落后于奇数的计数器,反之亦然。如果你有一个奇数操作和一个偶数操作,你就无法准确地判断哪个操作先发生。

  • Each node may process a different number of operations per second. Thus, if one node generates even numbers and the other generates odd numbers, the counter for even numbers may lag behind the counter for odd numbers, or vice versa. If you have an odd-numbered operation and an even-numbered operation, you cannot accurately tell which one causally happened first.

  • 物理时钟的时间戳会受到时钟偏差的影响,这可能会使它们与因果关系不一致。例如,参见图 8-3,它显示了一个场景,其中因果关系稍后发生的操作实际上被分配了较低的时间戳。

  • Timestamps from physical clocks are subject to clock skew, which can make them inconsistent with causality. For example, see Figure 8-3, which shows a scenario in which an operation that happened causally later was actually assigned a lower timestamp.viii

  • 在块分配器的情况下,一个操作可以被给予1,001到2,000范围内的序列号,并且因果上较晚的操作可以被给予1到1,000范围内的编号。这里,序列号再次与因果关系不一致。

  • In the case of the block allocator, one operation may be given a sequence number in the range from 1,001 to 2,000, and a causally later operation may be given a number in the range from 1 to 1,000. Here, again, the sequence number is inconsistent with causality.

兰波特时间戳

Lamport timestamps

虽然刚才描述的三种序列号生成器与因果关系不一致,但实际上有一种简单的方法可以生成与因果关系一致的序列。它被称为Lamport 时间戳,由 Leslie Lamport [ 56 ] 于 1978 年提出,现在是分布式系统领域被引用最多的论文之一。

Although the three sequence number generators just described are inconsistent with causality, there is actually a simple method for generating sequence numbers that is consistent with causality. It is called a Lamport timestamp, proposed in 1978 by Leslie Lamport [56], in what is now one of the most-cited papers in the field of distributed systems.

Lamport 时间戳的使用如图 9-8所示。每个节点都有一个唯一的标识符,并且每个节点都保存一个它已处理的操作数量的计数器。Lamport 时间戳只是一对(计数器节点 ID)。两个节点有时可能具有相同的计数器值,但通过在时间戳中包含节点 ID,每个时间戳都是唯一的。

The use of Lamport timestamps is illustrated in Figure 9-8. Each node has a unique identifier, and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of (counter, node ID). Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, each timestamp is made unique.

迪迪亚0908
图 9-8。Lamport 时间戳提供与因果关系一致的总排序。

Lamport 时间戳与物理时钟没有任何关系,但它提供总排序:如果有两个时间戳,则计数器值较大的一个是较大的时间戳;如果计数器值相同,则节点ID较大的时间戳较大。

A Lamport timestamp bears no relationship to a physical time-of-day clock, but it provides total ordering: if you have two timestamps, the one with a greater counter value is the greater timestamp; if the counter values are the same, the one with the greater node ID is the greater timestamp.

到目前为止,此描述本质上与上一节中描述的偶/奇计数器相同。Lamport 时间戳的关键思想(使它们与因果关系一致)如下:每个节点和每个客户端都跟踪迄今为止所看到的最大计数器值,并在每个请求中包含该最大值。当节点接收到最大计数器值大于其自身计数器值的请求或响应时,它立即将其自己的计数器增加到该最大值。

So far this description is essentially the same as the even/odd counters described in the last section. The key idea about Lamport timestamps, which makes them consistent with causality, is the following: every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

如图9-8所示,客户端A从节点2接收到计数器值为5,然后将最大值5发送给节点1。此时,节点1的计数器只有1,但它立即向前移动到 5,因此下一个操作的计数器值递增为 6。

This is shown in Figure 9-8, where client A receives a counter value of 5 from node 2, and then sends that maximum of 5 to node 1. At that time, node 1’s counter was only 1, but it was immediately moved forward to 5, so the next operation had an incremented counter value of 6.

只要每个操作都携带最大计数器值,该方案就可以确保 Lamport 时间戳的排序与因果关系一致,因为每个因果依赖性都会导致时间戳增加。

As long as the maximum counter value is carried along with every operation, this scheme ensures that the ordering from the Lamport timestamps is consistent with causality, because every causal dependency results in an increased timestamp.

Lamport 时间戳有时会与版本向量混淆,我们在 “检测并发写入”中看到了这一点。尽管有一些相似之处,但它们有不同的目的:版本向量可以区分两个操作是否并发或者一个操作是否因果依赖于另一个操作,而 Lamport 时间戳始终强制执行全排序。从 Lamport 时间戳的总排序中,您无法判断两个操作是否并发或是否存在因果关系。Lamport 时间戳相对于版本向量的优点是它们更紧凑。

Lamport timestamps are sometimes confused with version vectors, which we saw in “Detecting Concurrent Writes”. Although there are some similarities, they have a different purpose: version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other, whereas Lamport timestamps always enforce a total ordering. From the total ordering of Lamport timestamps, you cannot tell whether two operations are concurrent or whether they are causally dependent. The advantage of Lamport timestamps over version vectors is that they are more compact.

时间戳排序是不够的

Timestamp ordering is not sufficient

尽管 Lamport 时间戳定义了与因果关系一致的操作总顺序,但它们还不足以解决分布式系统中的许多常见问题。

Although Lamport timestamps define a total order of operations that is consistent with causality, they are not quite sufficient to solve many common problems in distributed systems.

例如,考虑一个需要确保用户名唯一标识用户帐户的系统。如果两个用户同时尝试使用相同的用户名创建帐户,则两者之一应该成功,另一个应该失败。(我们之前在“领导者与锁”中谈到过这个问题 。)

For example, consider a system that needs to ensure that a username uniquely identifies a user account. If two users concurrently try to create an account with the same username, one of the two should succeed and the other should fail. (We touched on this problem previously in “The leader and the lock”.)

乍一看,似乎操作的总排序(例如,使用 Lamport 时间戳)应该足以解决这个问题:如果创建了两个具有相同用户名的帐户,则选择具有较低时间戳的帐户作为获胜者(先抢到用户名的人),让时间戳较大的人失败。由于时间戳是完全有序的,因此这种比较始终有效。

At first glance, it seems as though a total ordering of operations (e.g., using Lamport timestamps) should be sufficient to solve this problem: if two accounts with the same username are created, pick the one with the lower timestamp as the winner (the one who grabbed the username first), and let the one with the greater timestamp fail. Since timestamps are totally ordered, this comparison is always valid.

这种方法适用于事后确定获胜者:一旦收集了系统中的所有用户名创建操作,您就可以比较它们的时间戳。然而,当节点刚刚收到用户创建用户名的请求时,这是不够的,并且需要立即决定请求是否应该成功或失败。此时,该节点不知道是否有其他节点同时正在创建具有相同用户名的帐户,以及其他节点可能会为该操作分配什么时间戳。

This approach works for determining the winner after the fact: once you have collected all the username creation operations in the system, you can compare their timestamps. However, it is not sufficient when a node has just received a request from a user to create a username, and needs to decide right now whether the request should succeed or fail. At that moment, the node does not know whether another node is concurrently in the process of creating an account with the same username, and what timestamp that other node may assign to the operation.

为了确保没有其他节点同时创建具有相同用户名和较低时间戳的帐户,您必须检查每个其他节点以了解它在做什么[56 ]。如果其他节点之一发生故障或由于网络问题而无法到达,则该系统将陷入瘫痪。这不是我们需要的容错系统。

In order to be sure that no other node is in the process of concurrently creating an account with the same username and a lower timestamp, you would have to check with every other node to see what it is doing [56]. If one of the other nodes has failed or cannot be reached due to a network problem, this system would grind to a halt. This is not the kind of fault-tolerant system that we need.

这里的问题是,只有在收集了所有操作之后,操作的总顺序才会出现。如果另一个节点生成了一些操作,但您还不知道它们是什么,则无法构建操作的最终顺序:来自另一个节点的未知操作可能需要插入到总顺序中的各个位置。

The problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.

总而言之:为了实现用户名的唯一性约束之类的功能,仅仅拥有操作的总顺序是不够的,您还需要知道该顺序何时最终确定。如果您有一个创建用户名的操作,并且您确定没有其他节点可以在总顺序中在您的操作之前插入同一用户名的声明,那么您可以安全地声明该操作成功。

To conclude: in order to implement something like a uniqueness constraint for usernames, it’s not sufficient to have a total ordering of operations—you also need to know when that order is finalized. If you have an operation to create a username, and you are sure that no other node can insert a claim for the same username ahead of your operation in the total order, then you can safely declare the operation successful.

总订单广播主题体现了了解总订单何时最终确定的想法。

This idea of knowing when your total order is finalized is captured in the topic of total order broadcast.

总订单广播

Total Order Broadcast

如果您的程序仅在单个 CPU 核心上运行,则很容易定义操作的总顺序:它只是 CPU 执行它们的顺序。然而,在分布式系统中,让所有节点就相同的操作总顺序达成一致是很棘手的。上一节我们讨论了按时间戳或序列号排序,但发现它不如单领导者复制强大(如果使用时间戳排序来实现唯一性约束,则不能容忍任何错误)。

If your program runs only on a single CPU core, it is easy to define a total ordering of operations: it is simply the order in which they were executed by the CPU. However, in a distributed system, getting all nodes to agree on the same total ordering of operations is tricky. In the last section we discussed ordering by timestamps or sequence numbers, but found that it is not as powerful as single-leader replication (if you use timestamp ordering to implement a uniqueness constraint, you cannot tolerate any faults).

正如所讨论的,单领导者复制通过选择一个节点作为领导者并对领导者上的单个 CPU 核心上的所有操作进行排序来确定操作的总顺序。接下来的挑战是,如果吞吐量大于单个领导者的处理能力,如何扩展系统,以及如果领导者发生故障,如何处理故障转移(请参阅“处理节点中断”)。在分布式系统文献 中, 这个问题被称为全序广播原子广播[ 25,57,58 ]。

As discussed, single-leader replication determines a total order of operations by choosing one node as the leader and sequencing all operations on a single CPU core on the leader. The challenge then is how to scale the system if the throughput is greater than a single leader can handle, and also how to handle failover if the leader fails (see “Handling Node Outages”). In the distributed systems literature, this problem is known as total order broadcast or atomic broadcast [25, 57, 58].ix

订购保证范围

Scope of ordering guarantee

每个分区有一个领导者的分区数据库通常只维护每个分区的排序,这意味着它们无法提供跨分区的一致性保证(例如,一致的快照、外键引用)。跨所有分区的总排序是可能的,但需要额外的协调[ 59 ]。

Partitioned databases with a single leader per partition often maintain ordering only per partition, which means they cannot offer consistency guarantees (e.g., consistent snapshots, foreign key references) across partitions. Total ordering across all partitions is possible, but requires additional coordination [59].

全序广播通常被描述为节点之间交换消息的协议。非正式地,它要求始终满足两个安全属性:

Total order broadcast is usually described as a protocol for exchanging messages between nodes. Informally, it requires that two safety properties always be satisfied:

可靠的交付
Reliable delivery

消息不会丢失:如果一条消息传递到一个节点,它就会传递到所有节点。

No messages are lost: if a message is delivered to one node, it is delivered to all nodes.

完全订购交付
Totally ordered delivery

消息以相同的顺序传递到每个节点。

Messages are delivered to every node in the same order.

正确的全序广播算法必须确保即使节点或网络出现故障,也始终满足可靠性和有序性。当然,当网络中断时,消息不会被传递,但是算法可以不断重试,以便当网络最终修复时消息能够通过(然后它们仍然必须以正确的顺序传递)。

A correct algorithm for total order broadcast must ensure that the reliability and ordering properties are always satisfied, even if a node or the network is faulty. Of course, messages will not be delivered while the network is interrupted, but an algorithm can keep retrying so that the messages get through when the network is eventually repaired (and then they must still be delivered in the correct order).

使用全序广播

Using total order broadcast

ZooKeeper、etcd等共识服务实际上实现了全序广播。这一事实暗示全序广播和共识之间存在紧密的联系,我们将在本章后面探讨这一点。

Consensus services such as ZooKeeper and etcd actually implement total order broadcast. This fact is a hint that there is a strong connection between total order broadcast and consensus, which we will explore later in this chapter.

全序广播正是数据库复制所需要的:如果每条消息都代表对数据库的写入,并且每个副本以相同的顺序处理相同的写入,那么副本将保持彼此一致(除了任何临时复制滞后) )。这个原理被称为状态机复制 [ 60 ],我们将在第11章中再次讨论它。

Total order broadcast is exactly what you need for database replication: if every message represents a write to the database, and every replica processes the same writes in the same order, then the replicas will remain consistent with each other (aside from any temporary replication lag). This principle is known as state machine replication [60], and we will return to it in Chapter 11.

类似地,全序广播可用于实现可序列化事务:如 “实际串行执行”中所述,如果每条消息都代表一个要作为存储过程执行的确定性事务,并且如果每个节点都以相同的顺序处理这些消息,则数据库的分区和副本保持一致[ 61 ]。

Similarly, total order broadcast can be used to implement serializable transactions: as discussed in “Actual Serial Execution”, if every message represents a deterministic transaction to be executed as a stored procedure, and if every node processes those messages in the same order, then the partitions and replicas of the database are kept consistent with each other [61].

全序广播的一个重要方面是,顺序在消息传递时是固定的:如果后续消息已经传递,则不允许节点追溯地将消息插入顺序中较早的位置。这一事实使得全序广播比时间戳排序更强。

An important aspect of total order broadcast is that the order is fixed at the time the messages are delivered: a node is not allowed to retroactively insert a message into an earlier position in the order if subsequent messages have already been delivered. This fact makes total order broadcast stronger than timestamp ordering.

查看全序广播的另一种方式是,它是一种创建日志的方式如在复制日志、事务日志或预写日志中):传递消息就像附加到日志中一样。由于所有节点必须以相同的顺序传递相同的消息,因此所有节点都可以读取日志并看到相同的消息序列。

Another way of looking at total order broadcast is that it is a way of creating a log (as in a replication log, transaction log, or write-ahead log): delivering a message is like appending to the log. Since all nodes must deliver the same messages in the same order, all nodes can read the log and see the same sequence of messages.

全序广播对于实现提供隔离令牌的锁定服务也很有用(请参阅“隔离令牌”)。每个获取锁的请求都会作为一条消息附加到日志中,并且所有消息都按照它们在日志中出现的顺序依次编号。然后,序列号可以用作隔离令牌,因为它是单调递增的。在ZooKeeper中,这个序列号被称为zxid [ 15 ]。

Total order broadcast is also useful for implementing a lock service that provides fencing tokens (see “Fencing tokens”). Every request to acquire the lock is appended as a message to the log, and all messages are sequentially numbered in the order they appear in the log. The sequence number can then serve as a fencing token, because it is monotonically increasing. In ZooKeeper, this sequence number is called zxid [15].

使用全序广播实现线性化存储

Implementing linearizable storage using total order broadcast

如图9-4 所示,在可线性化系统中存在总的操作顺序。这是否意味着线性化与全序广播相同?不完全是,但两者之间有着密切的联系。X

As illustrated in Figure 9-4, in a linearizable system there is a total order of operations. Does that mean linearizability is the same as total order broadcast? Not quite, but there are close links between the two.x

全序广播是异步的:保证消息以固定顺序可靠地传递,但不能保证消息何时传递(因此一个接收者可能落后于其他接收者)。相比之下,线性化是一种新近度保证:保证读取时看到最新写入的值。

Total order broadcast is asynchronous: messages are guaranteed to be delivered reliably in a fixed order, but there is no guarantee about when a message will be delivered (so one recipient may lag behind the others). By contrast, linearizability is a recency guarantee: a read is guaranteed to see the latest value written.

但是,如果您有全序广播,则可以在其之上构建线性化存储。例如,您可以确保用户名唯一标识用户帐户。

However, if you have total order broadcast, you can build linearizable storage on top of it. For example, you can ensure that usernames uniquely identify user accounts.

想象一下,对于每个可能的用户名,您都可以拥有一个具有原子比较和设置操作的线性化寄存器。每个寄存器最初都有值null(表明用户名未被占用)。当用户想要创建一个用户名时,您可以对该用户名的寄存器执行比较和设置操作,将其设置为用户帐户ID,前提是先前的寄存器值为nullnull如果多个用户尝试同时获取相同的用户名,则只有一个比较和设置操作会成功,因为其他用户将看到除(由于线性化)以外的值。

Imagine that for every possible username, you can have a linearizable register with an atomic compare-and-set operation. Every register initially has the value null (indicating that the username is not taken). When a user wants to create a username, you execute a compare-and-set operation on the register for that username, setting it to the user account ID, under the condition that the previous register value is null. If multiple users try to concurrently grab the same username, only one of the compare-and-set operations will succeed, because the others will see a value other than null (due to linearizability).

您可以通过使用全序广播作为仅附加日志 [ 62 , 63 ] 来实现这样的线性化比较和设置操作,如下所示:

You can implement such a linearizable compare-and-set operation as follows by using total order broadcast as an append-only log [62, 63]:

  1. 将一条消息附加到日志中,暂时指示您要声明的用户名。

  2. Append a message to the log, tentatively indicating the username you want to claim.

  3. 阅读日志,然后等待您附加的消息返回给您。

  4. Read the log, and wait for the message you appended to be delivered back to you.xi

  5. 检查是否有任何声称您想要的用户名的消息。如果您想要的用户名的第一条消息是您自己的消息,那么您就成功了:您可以提交用户名声明(可能通过将另一条消息附加到日志中)并向客户端确认。如果您所需用户名的第一条消息来自其他用户,您将中止该操作。

  6. Check for any messages claiming the username that you want. If the first message for your desired username is your own message, then you are successful: you can commit the username claim (perhaps by appending another message to the log) and acknowledge it to the client. If the first message for your desired username is from another user, you abort the operation.

因为日志条目以相同的顺序传递到所有节点,所以如果有多个并发写入,所有节点都会就哪个先发生达成一致。选择第一个冲突的写入作为获胜者并中止后面的写入可确保所有节点就写入是提交还是中止达成一致。类似的方法可用于在日志[ 62 ]之上实现可序列化的多对象事务。

Because log entries are delivered to all nodes in the same order, if there are several concurrent writes, all nodes will agree on which one came first. Choosing the first of the conflicting writes as the winner and aborting later ones ensures that all nodes agree on whether a write was committed or aborted. A similar approach can be used to implement serializable multi-object transactions on top of a log [62].

虽然此过程可确保线性化写入,但它不能保证线性化读取 — 如果您从从日志异步更新的存储中读取数据,则该存储可能已过时。(准确地说,这里描述的过程提供了顺序 一致性 [ 47,64 ],有时也称为时间线一致性 [ 65,66 ],比线性化稍弱的保证。)为了使读取线性化,有几个选项

While this procedure ensures linearizable writes, it doesn’t guarantee linearizable reads—if you read from a store that is asynchronously updated from the log, it may be stale. (To be precise, the procedure described here provides sequential consistency [47, 64], sometimes also known as timeline consistency [65, 66], a slightly weaker guarantee than linearizability.) To make reads linearizable, there are a few options:

  • 您可以通过附加消息、读取日志并在消息传回给您时执行实际读取来对日志进行顺序读取。因此,消息在日志中的位置定义了读取发生的时间点。(Quorum 在 etcd 工作中的读取有点像这样 [ 16 ]。)

  • You can sequence reads through the log by appending a message, reading the log, and performing the actual read when the message is delivered back to you. The message’s position in the log thus defines the point in time at which the read happens. (Quorum reads in etcd work somewhat like this [16].)

  • 如果日志允许您以线性化的方式获取最新日志消息的位置,您可以查询该位置,等待该位置之前的所有条目都传递给您,然后执行读取。(这是 ZooKeepersync()操作背后的想法 [ 15 ]。)

  • If the log allows you to fetch the position of the latest log message in a linearizable way, you can query that position, wait for all entries up to that position to be delivered to you, and then perform the read. (This is the idea behind ZooKeeper’s sync() operation [15].)

  • 您可以从在写入时同步更新的副本进行读取,因此确保是最新的。(该技术用于链式复制[ 63 ];另见“复制研究”。)

  • You can make your read from a replica that is synchronously updated on writes, and is thus sure to be up to date. (This technique is used in chain replication [63]; see also “Research on Replication”.)

使用线性化存储实现全序广播

Implementing total order broadcast using linearizable storage

最后一节展示了如何从全序广播构建线性化比较和设置操作。我们还可以扭转局面,假设我们有线性化存储,并展示如何从中构建全序广播。

The last section showed how to build a linearizable compare-and-set operation from total order broadcast. We can also turn it around, assume that we have linearizable storage, and show how to build total order broadcast from it.

最简单的方法是假设您有一个可线性化的寄存器,它存储一个整数并且具有原子增量和获取操作[ 28 ]。或者,原子比较和设置操作也可以完成这项工作。

The easiest way is to assume you have a linearizable register that stores an integer and that has an atomic increment-and-get operation [28]. Alternatively, an atomic compare-and-set operation would also do the job.

该算法很简单:对于要通过全序广播发送的每条消息,您递增并获取可线性化的整数,然后将从寄存器获得的值作为序列号附加到消息中。然后,您可以将消息发送到所有节点(重新发送任何丢失的消息),接收者将按序列号连续传递消息。

The algorithm is simple: for every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message. You can then send the message to all nodes (resending any lost messages), and the recipients will deliver the messages consecutively by sequence number.

请注意,与 Lamport 时间戳不同,通过递增线性化寄存器获得的数字形成一个没有间隙的序列。因此,如果一个节点已经传递了消息 4 并收到了序列号为 6 的传入消息,则它知道它必须等待消息 5 才能传递消息 6。但 Lamport 时间戳的情况并非如此,事实上,这是全序广播和时间戳排序之间的主要区别。

Note that unlike Lamport timestamps, the numbers you get from incrementing the linearizable register form a sequence with no gaps. Thus, if a node has delivered message 4 and receives an incoming message with a sequence number of 6, it knows that it must wait for message 5 before it can deliver message 6. The same is not the case with Lamport timestamps—in fact, this is the key difference between total order broadcast and timestamp ordering.

使用原子增量和获取操作生成可线性化整数有多难?像往常一样,如果事情从未失败,那就很容易了:您可以将其保存在一个节点上的变量中。问题在于处理该节点的网络连接中断时的情况,并在该节点发生故障时恢复该值[ 59 ]。一般来说,如果你对线性化序列号生成器进行了足够的思考,你不可避免地会得到一个共识算法。

How hard could it be to make a linearizable integer with an atomic increment-and-get operation? As usual, if things never failed, it would be easy: you could just keep it in a variable on one node. The problem lies in handling the situation when network connections to that node are interrupted, and restoring the value when that node fails [59]. In general, if you think hard enough about linearizable sequence number generators, you inevitably end up with a consensus algorithm.

这并非巧合:可以证明线性化的比较和设置(或增量和获取)寄存器和全序广播都等同于共识 [ 28 , 67 ]。也就是说,如果你能解决其中一个问题,你就可以将其转化为其他问题的解决方案。这是一个相当深刻和令人惊讶的见解!

This is no coincidence: it can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalent to consensus [28, 67]. That is, if you can solve one of these problems, you can transform it into a solution for the others. This is quite a profound and surprising insight!

现在是时候最终正面解决共识问题了,我们将在本章的其余部分中这样做。

It is time to finally tackle the consensus problem head-on, which we will do in the rest of this chapter.

分布式交易与共识

Distributed Transactions and Consensus

共识是分布式计算中最重要、最基础的问题之一。从表面上看,这似乎很简单:非正式地,目标只是让多个节点就某件事达成一致。您可能认为这应该不会太难。不幸的是,许多损坏的系统是错误地认为这个问题很容易解决的。

Consensus is one of the most important and fundamental problems in distributed computing. On the surface, it seems simple: informally, the goal is simply to get several nodes to agree on something. You might think that this shouldn’t be too hard. Unfortunately, many broken systems have been built in the mistaken belief that this problem is easy to solve.

尽管共识非常重要,但有关它的部分出现在本书的后面,因为该主题相当微妙,而理解其中的微妙之处需要一些先决知识。即使在学术研究界,对共识的理解也是经过几十年的时间才逐渐具体化的,过程中也存在许多误解。现在我们已经讨论了复制(第 5 章)、事务(第 7 章)、系统模型(第 8 章)、线性化和全序广播(本章),我们终于准备好解决共识问题了。

Although consensus is very important, the section about it appears late in this book because the topic is quite subtle, and appreciating the subtleties requires some prerequisite knowledge. Even in the academic research community, the understanding of consensus only gradually crystallized over the course of decades, with many misunderstandings along the way. Now that we have discussed replication (Chapter 5), transactions (Chapter 7), system models (Chapter 8), linearizability, and total order broadcast (this chapter), we are finally ready to tackle the consensus problem.

在许多情况下,节点达成一致非常重要。例如:

There are a number of situations in which it is important for nodes to agree. For example:

领导人选举
Leader election

在具有单领导者复制的数据库中,所有节点需要就哪个节点是领导者达成一致。如果某些节点由于网络故障而无法与其他节点通信,则领导地位可能会受到争夺。在这种情况下,共识对于避免错误的故障转移非常重要,从而导致两个节点都认为自己是领导者的脑裂情况(请参阅 “处理节点中断”)。如果有两个领导者,他们都会接受写入,并且他们的数据会出现分歧,从而导致不一致和数据丢失。

In a database with single-leader replication, all nodes need to agree on which node is the leader. The leadership position might become contested if some nodes can’t communicate with others due to a network fault. In this case, consensus is important to avoid a bad failover, resulting in a split brain situation in which two nodes both believe themselves to be the leader (see “Handling Node Outages”). If there were two leaders, they would both accept writes and their data would diverge, leading to inconsistency and data loss.

原子提交
Atomic commit

在支持跨多个节点或分区的事务的数据库中,我们遇到的问题是事务可能在某些节点上失败,但在其他节点上成功。如果我们想维护事务原子性(在 ACID 的意义上;参见“原子性”),我们必须让所有节点就事务的结果达成一致:要么它们全部中止/回滚(如果出现任何问题),要么它们全部提交(如果没有出问题)。这种共识实例称为 原子提交问题。十二

In a database that supports transactions spanning several nodes or partitions, we have the problem that a transaction may fail on some nodes but succeed on others. If we want to maintain transaction atomicity (in the sense of ACID; see “Atomicity”), we have to get all nodes to agree on the outcome of the transaction: either they all abort/roll back (if anything goes wrong) or they all commit (if nothing goes wrong). This instance of consensus is known as the atomic commit problem.xii

在本节中,我们将首先更详细地研究原子提交问题。特别是,我们将讨论两阶段提交(2PC)算法,它是解决原子提交的最常见方法,并且在各种数据库、消息传递系统和应用程序服务器中实现。事实证明,2PC 是一种共识算法,但不是一个很好的算法 [ 70 , 71 ]。

In this section we will first examine the atomic commit problem in more detail. In particular, we will discuss the two-phase commit (2PC) algorithm, which is the most common way of solving atomic commit and which is implemented in various databases, messaging systems, and application servers. It turns out that 2PC is a kind of consensus algorithm—but not a very good one [70, 71].

通过向 2PC 学习,我们将致力于开发更好的共识算法,例如 ZooKeeper (Zab) 和 etcd (Raft) 中使用的算法。

By learning from 2PC we will then work our way toward better consensus algorithms, such as those used in ZooKeeper (Zab) and etcd (Raft).

原子提交和两阶段提交(2PC)

Atomic Commit and Two-Phase Commit (2PC)

第 7 章中,我们了解到事务原子性的目的是在多次写入过程中出现问题时提供简单的语义。事务的结果要么是成功提交,在这种情况下事务的所有写入都变得持久,要么是中止在这种情况下事务的所有写入都被回滚(即撤消或丢弃)。

In Chapter 7 we learned that the purpose of transaction atomicity is to provide simple semantics in the case where something goes wrong in the middle of making several writes. The outcome of a transaction is either a successful commit, in which case all of the transaction’s writes are made durable, or an abort, in which case all of the transaction’s writes are rolled back (i.e., undone or discarded).

原子性可以防止失败的事务将半完成的结果和半更新的状态散落在数据库中。这对于多对象事务(参见 “单对象和多对象操作”)和维护二级索引的数据库尤其重要。每个二级索引都是独立于主数据的数据结构,因此,如果修改某些数据,二级索引也需要进行相应的更改。原子性确保辅助索引与主数据保持一致(如果索引与主数据不一致,则不会很有用)。

Atomicity prevents failed transactions from littering the database with half-finished results and half-updated state. This is especially important for multi-object transactions (see “Single-Object and Multi-Object Operations”) and databases that maintain secondary indexes. Each secondary index is a separate data structure from the primary data—thus, if you modify some data, the corresponding change needs to also be made in the secondary index. Atomicity ensures that the secondary index stays consistent with the primary data (if the index became inconsistent with the primary data, it would not be very useful).

从单节点到分布式原子提交

From single-node to distributed atomic commit

对于在单个数据库节点上执行的事务,原子性通常由存储引擎实现。当客户端请求数据库节点提交事务时,数据库使事务的写入持久化(通常在预写日志中;请参阅“使 B 树可靠”),然后将提交记录附加到磁盘上的日志中。如果数据库在此过程中崩溃,则当节点重新启动时,该事务将从日志中恢复:如果崩溃前提交记录已成功写入磁盘,则该事务被视为已提交;如果不是,则回滚该事务的任何写入。

For transactions that execute at a single database node, atomicity is commonly implemented by the storage engine. When the client asks the database node to commit the transaction, the database makes the transaction’s writes durable (typically in a write-ahead log; see “Making B-trees reliable”) and then appends a commit record to the log on disk. If the database crashes in the middle of this process, the transaction is recovered from the log when the node restarts: if the commit record was successfully written to disk before the crash, the transaction is considered committed; if not, any writes from that transaction are rolled back.

因此,在单个节点上,事务提交关键取决于数据持久写入磁盘的顺序:首先是数据,然后是提交记录[ 72 ]。事务提交还是中止的关键决定时刻是磁盘写完提交记录的时刻:在那一刻之前,仍然有可能中止(由于崩溃),但在那一刻之后,事务就被提交了(即使数据库崩溃)。因此,它是使提交原子化的单个设备(连接到一个特定节点的一个特定磁盘驱动器的控制器)。

Thus, on a single node, transaction commitment crucially depends on the order in which data is durably written to disk: first the data, then the commit record [72]. The key deciding moment for whether the transaction commits or aborts is the moment at which the disk finishes writing the commit record: before that moment, it is still possible to abort (due to a crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it is a single device (the controller of one particular disk drive, attached to one particular node) that makes the commit atomic.

但是,如果一笔交易涉及多个节点怎么办?例如,也许您在分区数据库中有一个多对象事务,或者有一个术语分区二级索引(其中索引条目可能位于与主数据不同的节点上;请参阅“分区和二级索引” 。大多数“NoSQL”分布式数据存储不支持此类分布式事务,但各种集群关系系统支持(参见 “实践中的分布式事务”)。

However, what if multiple nodes are involved in a transaction? For example, perhaps you have a multi-object transaction in a partitioned database, or a term-partitioned secondary index (in which the index entry may be on a different node from the primary data; see “Partitioning and Secondary Indexes”). Most “NoSQL” distributed datastores do not support such distributed transactions, but various clustered relational systems do (see “Distributed Transactions in Practice”).

在这些情况下,简单地向所有节点发送提交请求并在每个节点上独立提交事务是不够的。这样做时,很容易发生提交在某些节点上成功而在其他节点上失败的情况,这将违反原子性保证:

In these cases, it is not sufficient to simply send a commit request to all of the nodes and independently commit the transaction on each one. In doing so, it could easily happen that the commit succeeds on some nodes and fails on other nodes, which would violate the atomicity guarantee:

  • 某些节点可能会检测到约束违规或冲突,从而需要中止,而其他节点则能够成功提交。

  • Some nodes may detect a constraint violation or conflict, making an abort necessary, while other nodes are successfully able to commit.

  • 一些提交请求可能会在网络中丢失,最终由于超时而中止,而其他提交请求则可以通过。

  • Some of the commit requests might be lost in the network, eventually aborting due to a timeout, while other commit requests get through.

  • 某些节点可能会在提交记录完全写入并在恢复时回滚之前崩溃,而其他节点则成功提交。

  • Some nodes may crash before the commit record is fully written and roll back on recovery, while others successfully commit.

如果一些节点提交事务但其他节点中止事务,则节点之间会变得不一致(如图7-3 所示)。一旦事务在一个节点上提交,如果后来发现它在另一个节点上中止,则无法再次撤回该事务。因此,只有在确定事务中的所有其他节点也将提交时,节点才必须提交。

If some nodes commit the transaction but others abort it, the nodes become inconsistent with each other (like in Figure 7-3). And once a transaction has been committed on one node, it cannot be retracted again if it later turns out that it was aborted on another node. For this reason, a node must only commit once it is certain that all other nodes in the transaction are also going to commit.

事务提交必须是不可撤销的——在提交事务后,您不得改变主意并追溯中止事务。这条规则的原因是,一旦数据被提交,它就对其他事务可见,因此其他客户端可能开始依赖该数据;该原则构成了读已提交隔离的基础,在 “读已提交”中进行了讨论。如果允许事务在提交后中止,则任何读取已提交数据的事务都将基于追溯声明不存在的数据,因此它们也必须还原。

A transaction commit must be irrevocable—you are not allowed to change your mind and retroactively abort a transaction after it has been committed. The reason for this rule is that once data has been committed, it becomes visible to other transactions, and thus other clients may start relying on that data; this principle forms the basis of read committed isolation, discussed in “Read Committed”. If a transaction was allowed to abort after committing, any transactions that read the committed data would be based on data that was retroactively declared not to have existed—so they would have to be reverted as well.

(已提交事务的影响稍后可能会被另一个 补偿事务撤销[ 73 , 74 ]。但是,从数据库的角度来看,这是一个单独的事务,因此任何跨事务的正确性要求都是应用程序的问题。)

(It is possible for the effects of a committed transaction to later be undone by another, compensating transaction [73, 74]. However, from the database’s point of view this is a separate transaction, and thus any cross-transaction correctness requirements are the application’s problem.)

两阶段提交简介

Introduction to two-phase commit

两阶段提交是一种用于实现跨多个节点的原子事务提交的算法,即确保所有节点都提交或所有节点中止。它是分布式数据库中的 经典 算法[ 13,35,75 ]。2PC 在某些数据库内部使用,也以XA 事务 [ 76 , 77 ](例如,由 Java Transaction API 支持)的形式或通过用于 SOAP Web 服务的 WS-AtomicTransaction [ 78 , 79]的形式提供给应用程序使用]。

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes—i.e., to ensure that either all nodes commit or all nodes abort. It is a classic algorithm in distributed databases [13, 35, 75]. 2PC is used internally in some databases and also made available to applications in the form of XA transactions [76, 77] (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP web services [78, 79].

2PC的基本流程如图9-9所示。与单节点事务不同,2PC 中的提交/中止过程分为两个阶段(因此得名),而不是单个提交请求。

The basic flow of 2PC is illustrated in Figure 9-9. Instead of a single commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two phases (hence the name).

直达0909
图 9-9。成功执行两阶段提交(2PC)。

不要混淆 2PC 和 2PL

Don’t confuse 2PC and 2PL

两阶段提交(2PC)和两阶段锁定(参见“两阶段锁定(2PL)”)是两个截然不同的东西。2PC 在分布式数据库中提供原子提交,而 2PL 提供可序列化隔离。为了避免混淆,最好将它们视为完全独立的概念,并忽略名称中不幸的相似之处。

Two-phase commit (2PC) and two-phase locking (see “Two-Phase Locking (2PL)”) are two very different things. 2PC provides atomic commit in a distributed database, whereas 2PL provides serializable isolation. To avoid confusion, it’s best to think of them as entirely separate concepts and to ignore the unfortunate similarity in the names.

2PC 使用了一个通常不会出现在单节点事务中的新组件: 协调器(也称为事务管理器)。协调器通常作为请求事务的同一应用程序进程中的库来实现(例如,嵌入在Java EE 容器中),但它也可以是单独的进程或服务。此类协调员的示例包括 Narayana、JOTM、BTM 或 MSDTC。

2PC uses a new component that does not normally appear in single-node transactions: a coordinator (also known as transaction manager). The coordinator is often implemented as a library within the same application process that is requesting the transaction (e.g., embedded in a Java EE container), but it can also be a separate process or service. Examples of such coordinators include Narayana, JOTM, BTM, or MSDTC.

像平常一样,2PC 事务从应用程序在多个数据库节点上读取和写入数据开始。我们称这些数据库节点为事务的参与者。当应用程序准备好提交时,协调器开始第 1 阶段:它向每个节点发送准备请求,询问它们是否能够提交。然后协调员跟踪参与者的响应:

A 2PC transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transaction. When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit. The coordinator then tracks the responses from the participants:

  • 如果所有参与者都回答“是”,表明他们已准备好提交,则协调者在第 2 阶段发出提交请求,并且提交实际上发生。

  • If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends out a commit request in phase 2, and the commit actually takes place.

  • 如果任何参与者回答“否”,协调器将向第 2 阶段的所有节点发送中止请求。

  • If any of the participants replies “no,” the coordinator sends an abort request to all nodes in phase 2.

这个过程有点像西方文化中的传统婚礼:牧师分别询问新娘和新郎是否愿意嫁给对方,通常都会得到双方的回答“我愿意”。收到双方的确认后,部长宣布这对夫妇为夫妻:交易已完成,并向所有与会者广播这一喜讯。如果新娘或新郎不说“是”,仪式就会中止[ 73 ]。

This process is somewhat like the traditional marriage ceremony in Western cultures: the minister asks the bride and groom individually whether each wants to marry the other, and typically receives the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the couple husband and wife: the transaction is committed, and the happy fact is broadcast to all attendees. If either bride or groom does not say “yes,” the ceremony is aborted [73].

承诺体系

A system of promises

从这个简短的描述中可能不清楚为什么两阶段提交可以确保原子性,而跨多个节点的一阶段提交则不能。当然,在两阶段情况下,准备和提交请求也很容易丢失。2PC 有何不同?

From this short description it might not be clear why two-phase commit ensures atomicity, while one-phase commit across several nodes does not. Surely the prepare and commit requests can just as easily be lost in the two-phase case. What makes 2PC different?

为了理解它为什么起作用,我们必须更详细地分解这个过程:

To understand why it works, we have to break down the process in a bit more detail:

  1. 当应用程序想要开始分布式事务时,它会向协调器请求事务 ID。这个交易ID是全球唯一的。

  2. When the application wants to begin a distributed transaction, it requests a transaction ID from the coordinator. This transaction ID is globally unique.

  3. 应用程序在每个参与者上开始单节点交易,并将全局唯一的交易ID附加到单节点交易上。所有读取和写入均在这些单节点事务之一中完成。如果在此阶段出现任何问题(例如,节点崩溃或请求超时),协调者或任何参与者都可以中止。

  4. The application begins a single-node transaction on each of the participants, and attaches the globally unique transaction ID to the single-node transaction. All reads and writes are done in one of these single-node transactions. If anything goes wrong at this stage (for example, a node crashes or a request times out), the coordinator or any of the participants can abort.

  5. 当应用程序准备好提交时,协调器向所有参与者发送一个准备请求,并标记有全局事务 ID。如果这些请求中的任何一个失败或超时,协调器都会向所有参与者发送该事务 ID 的中止请求。

  6. When the application is ready to commit, the coordinator sends a prepare request to all participants, tagged with the global transaction ID. If any of these requests fails or times out, the coordinator sends an abort request for that transaction ID to all participants.

  7. 当参与者收到准备请求时,它确保在任何情况下都可以肯定地提交事务。 这包括将所有事务数据写入磁盘(崩溃、电源故障或磁盘空间不足不是拒绝稍后提交的可接受借口),并检查是否有任何冲突或约束违规。通过向协调器回复“是”,节点承诺在收到请求时无错误地提交事务。换句话说,参与者放弃了中止交易的权利,但没有实际提交交易。

  8. When a participant receives the prepare request, it makes sure that it can definitely commit the transaction under all circumstances. This includes writing all transaction data to disk (a crash, a power failure, or running out of disk space is not an acceptable excuse for refusing to commit later), and checking for any conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested. In other words, the participant surrenders the right to abort the transaction, but without actually committing it.

  9. 当协调器收到对所有准备请求的响应时,它会做出是否提交或中止事务的明确决定(仅当所有参与者都投票“是”时才提交)。协调器必须将该决定写入磁盘上的事务日志,以便在随后崩溃时知道其决定的方式。这称为提交点

  10. When the coordinator has received responses to all prepare requests, it makes a definitive decision on whether to commit or abort the transaction (committing only if all participants voted “yes”). The coordinator must write that decision to its transaction log on disk so that it knows which way it decided in case it subsequently crashes. This is called the commit point.

  11. 一旦协调者的决定被写入磁盘,提交或中止请求就会发送给所有参与者。如果此请求失败或超时,协调器必须永远重试,直到成功。不再有回头路:如果决定要提交,则必须执行该决定,无论重试多少次。如果参与者在此期间崩溃了,事务将在恢复时提交——因为参与者投票“是”,所以它不能拒绝在恢复时提交。

  12. Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the transaction will be committed when it recovers—since the participant voted “yes,” it cannot refuse to commit when it recovers.

因此,该协议包含两个关键的“不归点”:当参与者投票“是”时,它承诺稍后一定能够提交(尽管协调者仍然可能选择中止);一旦协调员做出决定,该决定就不可撤销。这些承诺确保了 2PC 的原子性。(单节点原子提交将这两个事件合二为一:将提交记录写入事务日志。)

Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it promises that it will definitely be able to commit later (although the coordinator may still choose to abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the atomicity of 2PC. (Single-node atomic commit lumps these two events into one: writing the commit record to the transaction log.)

回到婚姻的类比,在说“我愿意”之前,您和您的新娘/新郎可以自由地通过说“不行!”来中止交易。(或类似的东西)。然而,在说了“我愿意”之后,你就不能撤回该声明。如果您在说“我愿意”后晕倒,并且没有听到部长说“你们现在是夫妻”,这并不能改变交易已完成的事实。当你稍后恢复意识时,你可以通过向部长查询你的全局事务ID的状态来了解你是否已婚,或者你可以等待部长下一次重试提交请求(因为重试将在整个过程中持续进行)你的无意识时期)。

Returning to the marriage analogy, before saying “I do,” you and your bride/groom have the freedom to abort the transaction by saying “No way!” (or something to that effect). However, after saying “I do,” you cannot retract that statement. If you faint after saying “I do” and you don’t hear the minister speak the words “You are now husband and wife,” that doesn’t change the fact that the transaction was committed. When you recover consciousness later, you can find out whether you are married or not by querying the minister for the status of your global transaction ID, or you can wait for the minister’s next retry of the commit request (since the retries will have continued throughout your period of unconsciousness).

协调器故障

Coordinator failure

我们已经讨论过如果在 2PC 期间参与者之一或网络失败会发生什么:如果任何准备请求失败或超时,协调器将中止事务;如果任何提交或中止请求失败,协调器将无限期地重试。然而,尚不清楚如果协调器崩溃会发生什么。

We have discussed what happens if one of the participants or the network fails during 2PC: if any of the prepare requests fail or time out, the coordinator aborts the transaction; if any of the commit or abort requests fail, the coordinator retries them indefinitely. However, it is less clear what happens if the coordinator crashes.

如果协调者在发送准备请求之前失败,参与者可以安全地中止事务。但是,一旦参与者收到准备请求并投了“是”票,它就不能再单方面中止——它必须等待协调员的回复,以确定事务是已提交还是中止。如果此时协调者崩溃或者网络出现故障,参与者只能等待。参与者在这种状态下的交易被称为有疑问不确定

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction. But once the participant has received a prepare request and voted “yes,” it can no longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction was committed or aborted. If the coordinator crashes or the network fails at this point, the participant can do nothing but wait. A participant’s transaction in this state is called in doubt or uncertain.

其情况如图9-10所示。在这个特定的示例中,协调器实际上决定提交,并且数据库 2 收到了提交请求。但是,协调器在向数据库 1 发送提交请求之前崩溃了,因此数据库 1 不知道是提交还是中止。即使超时也无济于事:如果数据库 1 在超时后单方面中止,则最终将与已提交的数据库 2 不一致。同样,单方面提交也不安全,因为另一个参与者可能已经中止。

The situation is illustrated in Figure 9-10. In this particular example, the coordinator actually decided to commit, and database 2 received the commit request. However, the coordinator crashed before it could send the commit request to database 1, and so database 1 does not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly, it is not safe to unilaterally commit, because another participant may have aborted.

直达0910
图 9-10。参与者投票“同意”后,协调器崩溃。数据库 1 不知道是提交还是中止。

如果没有协调员的消息,参与者就无法知道是否要提交或中止。原则上,参与者可以相互通信,以了解每个参与者如何投票并达成某种协议,但这不是 2PC 协议的一部分。

Without hearing from the coordinator, the participant has no way of knowing whether to commit or abort. In principle, the participants could communicate among themselves to find out how each participant voted and come to some agreement, but that is not part of the 2PC protocol.

2PC 完成的唯一方法是等待协调器恢复。这就是为什么协调器必须在向参与者发送提交或中止请求之前将其提交或中止决策写入磁盘上的事务日志的原因:当协调器恢复时,它通过读取其事务日志来确定所有可疑事务的状态。协调器日志中没有提交记录的任何事务都将被中止。因此,2PC 的提交点归结为协调器上的常规单节点原子提交。

The only way 2PC can complete is by waiting for the coordinator to recover. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants: when the coordinator recovers, it determines the status of all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular single-node atomic commit on the coordinator.

三阶段提交

Three-phase commit

两阶段提交被称为阻塞原子提交协议,因为 2PC 可能会卡住等待协调器恢复。理论上,可以使原子提交协议成为非阻塞的,这样在节点发生故障时它就不会被卡住。然而,在实践中实现这项工作并不那么简单。

Two-phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for the coordinator to recover. In theory, it is possible to make an atomic commit protocol nonblocking, so that it does not get stuck if a node fails. However, making this work in practice is not so straightforward.

作为 2PC 的替代方案,提出了一种称为三相提交(3PC) 的算法 [ 13 , 80 ]。然而,3PC 假设网络延迟有限,节点响应时间有限;在大多数具有无限网络延迟和进程暂停的实际系统中(参见第8章),它不能保证原子性。

As an alternative to 2PC, an algorithm called three-phase commit (3PC) has been proposed [13, 80]. However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most practical systems with unbounded network delay and process pauses (see Chapter 8), it cannot guarantee atomicity.

一般来说,非阻塞原子提交需要一个完美的故障检测器 [ 67 , 71 ]——即判断节点是否崩溃的可靠机制。在具有无限延迟的网络中,超时并不是可靠的故障检测器,因为即使没有节点崩溃,请求也可能由于网络问题而超时。因此,尽管存在已知的协调器故障问题,但仍继续使用 2PC。

In general, nonblocking atomic commit requires a perfect failure detector [67, 71]—i.e., a reliable mechanism for telling whether a node has crashed or not. In a network with unbounded delay a timeout is not a reliable failure detector, because a request may time out due to a network problem even if no node has crashed. For this reason, 2PC continues to be used, despite the known problem with coordinator failure.

分布式事务实践

Distributed Transactions in Practice

分布式事务,尤其是那些通过两阶段提交实现的事务,声誉褒贬不一。一方面,它们被视为提供了否则难以实现的重要安全保障;另一方面,他们因造成运营问题、扼杀绩效以及承诺超出其能力受到批评 [ 81,82,83,84 ] 。 由于分布式事务产生的操作问题,许多云服务选择不实现分布式事务[ 85 , 86 ]。

Distributed transactions, especially those implemented with two-phase commit, have a mixed reputation. On the one hand, they are seen as providing an important safety guarantee that would be hard to achieve otherwise; on the other hand, they are criticized for causing operational problems, killing performance, and promising more than they can deliver [81, 82, 83, 84]. Many cloud services choose not to implement distributed transactions due to the operational problems they engender [85, 86].

分布式事务的某些实现会带来严重的性能损失——例如,据报道,MySQL 中的分布式事务比单节点事务慢 10 倍以上 [ 87 ],因此当人们建议不要使用它们时,这并不奇怪。两阶段提交固有的大部分性能成本是由于崩溃恢复 [ 88fsync ]所需的额外磁盘强制 ()以及额外的网络往返造成的。

Some implementations of distributed transactions carry a heavy performance penalty—for example, distributed transactions in MySQL are reported to be over 10 times slower than single-node transactions [87], so it is not surprising when people advise against using them. Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (fsync) that is required for crash recovery [88], and the additional network round-trips.

然而,我们不应该完全否定分布式事务,而应该更详细地研究它们,因为我们可以从中吸取重要的教训。首先,我们应该准确理解“分布式事务”的含义。两种截然不同类型的分布式事务经常被混淆:

However, rather than dismissing distributed transactions outright, we should examine them in some more detail, because there are important lessons to be learned from them. To begin, we should be precise about what we mean by “distributed transactions.” Two quite different types of distributed transactions are often conflated:

数据库内部分布式事务
Database-internal distributed transactions

一些分布式数据库(即,在其标准配置中使用复制和分区的数据库)支持该数据库的节点之间的内部事务。例如VoltDB和MySQL Cluster的NDB存储引擎就有这样的内部事务支持。在这种情况下,参与交易的所有节点都运行相同的数据库软件。

Some distributed databases (i.e., databases that use replication and partitioning in their standard configuration) support internal transactions among the nodes of that database. For example, VoltDB and MySQL Cluster’s NDB storage engine have such internal transaction support. In this case, all the nodes participating in the transaction are running the same database software.

异构分布式事务
Heterogeneous distributed transactions

异构事务中,参与者是两种或多种不同的技术:例如,来自不同供应商的两个数据库,甚至是消息代理等非数据库系统。跨这些系统的分布式事务必须确保原子提交,即使这些系统在底层可能完全不同。

In a heterogeneous transaction, the participants are two or more different technologies: for example, two databases from different vendors, or even non-database systems such as message brokers. A distributed transaction across these systems must ensure atomic commit, even though the systems may be entirely different under the hood.

数据库内部事务不必与任何其他系统兼容,因此它们可以使用任何协议并应用特定于该特定技术的优化。因此,数据库内部的分布式事务通常可以很好地工作。另一方面,跨越异构技术的交易更具挑战性。

Database-internal transactions do not have to be compatible with any other system, so they can use any protocol and apply optimizations specific to that particular technology. For that reason, database-internal distributed transactions can often work quite well. On the other hand, transactions spanning heterogeneous technologies are a lot more challenging.

一次性消息处理

Exactly-once message processing

异构分布式事务允许以强大的方式集成不同的系统。例如,当且仅当用于处理消息的数据库事务已成功提交时,来自消息队列的消息才可以被确认为已处理。这是通过原子地提交消息确认和数据库在单个事务中写入来实现的。借助分布式事务支持,即使消息代理和数据库是运行在不同机器上的两种不相关的技术,这也是可能的。

Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For example, a message from a message queue can be acknowledged as processed if and only if the database transaction for processing the message was successfully committed. This is implemented by atomically committing the message acknowledgment and the database writes in a single transaction. With distributed transaction support, this is possible, even if the message broker and the database are two unrelated technologies running on different machines.

如果消息传递或数据库事务失败,两者都会中止,因此消息代理可以稍后安全地重新传递消息。因此,通过原子地提交消息及其处理的副作用,我们可以确保消息被有效地处理一次,即使在成功之前需要重试几次。中止会丢弃部分完成的事务的任何副作用。

If either the message delivery or the database transaction fails, both are aborted, and so the message broker may safely redeliver the message later. Thus, by atomically committing the message and the side effects of its processing, we can ensure that the message is effectively processed exactly once, even if it required a few retries before it succeeded. The abort discards any side effects of the partially completed transaction.

然而,只有受事务影响的所有系统都能够使用相同的原子提交协议,这样的分布式事务才有可能。例如,假设处理消息的副作用是发送电子邮件,并且电子邮件服务器不支持两阶段提交:如果消息处理失败并重试,则可能会发生电子邮件发送两次或更多次的情况。但是,如果在事务中止时回滚处理消息的所有副作用,则可以安全地重试处理步骤,就好像什么也没发生一样。

Such a distributed transaction is only possible if all systems affected by the transaction are able to use the same atomic commit protocol, however. For example, say a side effect of processing a message is to send an email, and the email server does not support two-phase commit: it could happen that the email is sent two or more times if message processing fails and is retried. But if all side effects of processing a message are rolled back on transaction abort, then the processing step can safely be retried as if nothing had happened.

我们将在第 11 章回到一次性消息处理的主题。我们首先看一下允许这种异构分布式事务的原子提交协议。

We will return to the topic of exactly-once message processing in Chapter 11. Let’s look first at the atomic commit protocol that allows such heterogeneous distributed transactions.

XA交易

XA transactions

X/Open XA (扩展架构的缩写)是跨异构技术实现两阶段提交的标准 [ 76 , 77 ]。它于 1991 年推出,并已得到广泛实施:XA 受到许多传统关系数据库(包括 PostgreSQL、MySQL、DB2、SQL Server 和 Oracle)和消息代理(包括 ActiveMQ、HornetQ、MSMQ 和 IBM MQ)的支持。

X/Open XA (short for eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies [76, 77]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, DB2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).

XA 不是一个网络协议,它只是一个用于与事务协调器连接的 C API。其他语言中也存在此 API 的绑定;例如,在 Java EE 应用程序领域,XA 事务是使用 Java 事务 API (JTA) 实现的,而许多使用 Java 数据库连接 (JDBC) 的数据库驱动程序和使用 Java 消息的消息代理驱动程序又支持该事务服务 (JMS) API。

XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator. Bindings for this API exist in other languages; for example, in the world of Java EE applications, XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by many drivers for databases using Java Database Connectivity (JDBC) and drivers for message brokers using the Java Message Service (JMS) APIs.

XA 假定您的应用程序使用网络驱动程序或客户端库与参与者数据库或消息传递服务进行通信。如果驱动程序支持 XA,这意味着它会调用 XA API 来确定某个操作是否应该成为分布式事务的一部分,如果是,它会将必要的信息发送到数据库服务器。驱动程序还公开回调,协调器可以通过这些回调要求参与者准备、提交或中止。

XA assumes that your application uses a network driver or client library to communicate with the participant databases or messaging services. If the driver supports XA, that means it calls the XA API to find out whether an operation should be part of a distributed transaction—and if so, it sends the necessary information to the database server. The driver also exposes callbacks through which the coordinator can ask the participant to prepare, commit, or abort.

事务协调器实现XA API。该标准没有指定它应该如何实现,但在实践中,协调器通常只是一个库,它加载到与发出事务的应用程序相同的进程中(而不是单独的服务)。它跟踪事务中的参与者,在要求参与者准备后收集参与者的响应(通过驱动程序的回调),并使用本地磁盘上的日志来跟踪每个事务的提交/中止决策。

The transaction coordinator implements the XA API. The standard does not specify how it should be implemented, but in practice the coordinator is often simply a library that is loaded into the same process as the application issuing the transaction (not a separate service). It keeps track of the participants in a transaction, collects partipants’ responses after asking them to prepare (via a callback into the driver), and uses a log on the local disk to keep track of the commit/abort decision for each transaction.

如果应用程序进程崩溃,或者运行应用程序的机器崩溃,协调器也会随之崩溃。任何已准备好但未提交交易的参与者都会陷入怀疑。由于协调器的日志位于应用程序服务器的本地磁盘上,因此必须重新启动该服务器,并且协调器库必须读取日志以恢复每个事务的提交/中止结果。只有这样,协调器才能使用数据库驱动程序的 XA 回调来要求参与者根据需要提交或中止。数据库服务器无法直接联系协调器,因为所有通信都必须通过其客户端库进行。

If the application process crashes, or the machine on which the application is running dies, the coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be restarted, and the coordinator library must read the log to recover the commit/abort outcome of each transaction. Only then can the coordinator use the database driver’s XA callbacks to ask participants to commit or abort, as appropriate. The database server cannot contact the coordinator directly, since all communication must go via its client library.

怀疑时持有锁

Holding locks while in doubt

为什么我们如此关心陷入疑问的交易?难道系统的其余部分就不能继续工作,而忽略最终将被清理的可疑事务吗?

Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually?

问题在于锁定。正如“已提交读”中所讨论的,数据库事务通常对其修改的任何行采取行级排他锁,以防止脏写。此外,如果您想要可序列化的隔离,使用两阶段锁定的数据库还必须对事务读取的任何行采取共享锁(请参阅“两阶段锁定(2PL)”)。

The problem is with locking. As discussed in “Read Committed”, database transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty writes. In addition, if you want serializable isolation, a database using two-phase locking would also have to take a shared lock on any rows read by the transaction (see “Two-Phase Locking (2PL)”).

在事务提交或中止之前,数据库无法释放这些锁(如图9-9中的阴影区域所示)。因此,当使用两阶段提交时,事务必须在有疑问的整个时间内保持锁定。如果协调器崩溃并需要 20 分钟才能再次启动,则这些锁将保留 20 分钟。如果协调器的日志由于某种原因完全丢失,这些锁将永远保留,或者至少直到管理员手动解决该情况为止。

The database cannot release those locks until the transaction commits or aborts (illustrated as a shaded area in Figure 9-9). Therefore, when using two-phase commit, a transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least until the situation is manually resolved by an administrator.

当这些锁被持有时,没有其他事务可以修改这些行。根据数据库的不同,其他事务甚至可能被阻止读取这些行。因此,其他事务不能简单地继续其业务——如果它们想要访问相同的数据,它们将被阻止。这可能会导致应用程序的大部分内容不可用,直到有疑问的事务得到解决为止。

While those locks are held, no other transaction can modify those rows. Depending on the database, other transactions may even be blocked from reading those rows. Thus, other transactions cannot simply continue with their business—if they want to access that same data, they will be blocked. This can cause large parts of your application to become unavailable until the in-doubt transaction is resolved.

从协调器故障中恢复

Recovering from coordinator failure

理论上,如果协调器崩溃并重新启动,它应该从日志中干净地恢复其状态并解决任何有疑问的事务。然而,在实践中,孤立的不确定事务确实会发生[ 89 , 90 ]——也就是说,协调器由于任何原因都无法决定结果的事务(例如,因为事务日志由于软件错误而丢失或损坏) )。这些事务无法自动解决,因此它们永远保留在数据库中,持有锁并阻塞其他事务。

In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the log and resolve any in-doubt transactions. However, in practice, orphaned in-doubt transactions do occur [89, 90]—that is, transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because the transaction log has been lost or corrupted due to a software bug). These transactions cannot be resolved automatically, so they sit forever in the database, holding locks and blocking other transactions.

即使重新启动数据库服务器也无法解决此问题,因为 2PC 的正确实现必须在重新启动时保留不确定事务的锁(否则可能会违反原子性保证)。这是一个棘手的情况。

Even rebooting your database servers will not fix this problem, since a correct implementation of 2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk violating the atomicity guarantee). It’s a sticky situation.

唯一的出路是管理员手动决定是否提交或回滚事务。管理员必须检查每个可疑事务的参与者,确定是否有任何参与者已提交或中止,然后将相同的结果应用于其他参与者。解决问题可能需要大量的人力,并且很可能需要在严重的生产中断期间在高压力和时间压力下完成(否则,为什么协调员会处于如此糟糕的状态?)。

The only way out is for an administrator to manually decide whether to commit or roll back the transactions. The administrator must examine the participants of each in-doubt transaction, determine whether any participant has committed or aborted already, and then apply the same outcome to the other participants. Resolving the problem potentially requires a lot of manual effort, and most likely needs to be done under high stress and time pressure during a serious production outage (otherwise, why would the coordinator be in such a bad state?).

许多 XA 实现都有一个称为启发式决策的紧急 逃生口: 允许参与者单方面决定中止或提交可疑事务,而无需协调者做出明确决定[ 76,77,91 ]。需要明确的是,这里的启发式是可能破坏原子性的委婉说法,因为它违反了两阶段提交中的承诺系统。因此,启发式决策仅用于摆脱灾难性情况,而不是常规使用。

Many XA implementations have an emergency escape hatch called heuristic decisions: allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive decision from the coordinator [76, 77, 91]. To be clear, heuristic here is a euphemism for probably breaking atomicity, since it violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for getting out of catastrophic situations, and not for regular use.

分布式事务的局限性

Limitations of distributed transactions

XA 事务解决了保持多个参与者数据系统彼此一致的真正且重要的问题,但正如我们所看到的,它们也引入了主要的操作问题。特别是,关键的认识是事务协调器本身就是一种数据库(其中存储事务结果),因此需要像任何其他重要数据库一样谨慎对待它:

XA transactions solve the real and important problem of keeping several participant data systems consistent with each other, but as we have seen, they also introduce major operational problems. In particular, the key realization is that the transaction coordinator is itself a kind of database (in which transaction outcomes are stored), and so it needs to be approached with the same care as any other important database:

  • 如果协调器没有复制而是只在一台机器上运行,那么它就是整个系统的单点故障(因为它的故障会导致其他应用程序服务器阻塞可疑事务所持有的锁)。令人惊讶的是,许多协调器实现默认情况下可用性不高,或者仅具有基本的复制支持。

  • If the coordinator is not replicated but runs only on a single machine, it is a single point of failure for the entire system (since its failure causes other application servers to block on locks held by in-doubt transactions). Surprisingly, many coordinator implementations are not highly available by default, or have only rudimentary replication support.

  • 许多服务器端应用程序都是以无状态模型开发的(HTTP 所青睐的),所有持久状态都存储在数据库中,这样的优点是可以随意添加和删除应用程序服务器。但是,当协调器是应用程序服务器的一部分时,它会改变部署的性质。突然间,协调器的日志成为持久系统状态的关键部分,与数据库本身一样重要,因为需要协调器日志才能在崩溃后恢复可疑事务。此类应用程序服务器不再是无状态的。

  • Many server-side applications are developed in a stateless model (as favored by HTTP), with all persistent state stored in a database, which has the advantage that application servers can be added and removed at will. However, when the coordinator is part of the application server, it changes the nature of the deployment. Suddenly, the coordinator’s logs become a crucial part of the durable system state—as important as the databases themselves, since the coordinator logs are required in order to recover in-doubt transactions after a crash. Such application servers are no longer stateless.

  • 由于XA需要与广泛的数据系统兼容,因此它必然是一个最低公分母。例如,它无法检测不同系统之间的死锁(因为这需要系统采用标准化协议来交换每个事务正在等待的锁的信息),并且它不能与 SSI 一起使用(请参阅“可串行化快照隔离 (SSI)” ) ”),因为这需要一个协议来识别不同系统之间的冲突。

  • Since XA needs to be compatible with a wide range of data systems, it is necessarily a lowest common denominator. For example, it cannot detect deadlocks across different systems (since that would require a standardized protocol for systems to exchange information on the locks that each transaction is waiting for), and it does not work with SSI (see “Serializable Snapshot Isolation (SSI)”), since that would require a protocol for identifying conflicts across different systems.

  • 对于数据库内部的分布式事务(不是 XA),限制并不是那么大,例如,SSI 的分布式版本是可能的。然而,仍然存在一个问题:2PC 要成功提交交易,所有参与者都必须做出响应。因此,如果系统的任何部分被破坏,交易也会失败。因此分布式事务有放大故障的倾向,这与我们构建容错系统的目标背道而驰。

  • For database-internal distributed transactions (not XA), the limitations are not so great—for example, a distributed version of SSI is possible. However, there remains the problem that for 2PC to successfully commit a transaction, all participants must respond. Consequently, if any part of the system is broken, the transaction also fails. Distributed transactions thus have a tendency of amplifying failures, which runs counter to our goal of building fault-tolerant systems.

这些事实是否意味着我们应该放弃保持多个系统相互一致的所有希望?不完全是——有一些替代方法可以让我们实现同样的事情,而无需经历异构分布式事务的痛苦。我们将在第 11 章和12 章中再次讨论这些内容 。但首先,我们应该结束共识这个话题。

Do these facts mean we should give up all hope of keeping several systems consistent with each other? Not quite—there are alternative methods that allow us to achieve the same thing without the pain of heterogeneous distributed transactions. We will return to these in Chapters 11 and 12. But first, we should wrap up the topic of consensus.

容错共识

Fault-Tolerant Consensus

非正式地,共识意味着让多个节点就某件事达成一致。例如,如果几个人同时尝试预订飞机上的最后一个座位,或者剧院中的同一个座位,或者尝试使用相同的用户名注册帐户,那么可以使用共识算法来确定其中哪一个是相互的不兼容的操作应该是赢家。

Informally, consensus means getting several nodes to agree on something. For example, if several people concurrently try to book the last seat on an airplane, or the same seat in a theater, or try to register an account with the same username, then a consensus algorithm could be used to determine which one of these mutually incompatible operations should be the winner.

共识问题通常形式化如下:一个或多个节点可以提出值,并且共识算法决定这些值之一。在座位预订示例中,当多个客户同时尝试购买最后一个座位时,处理客户请求的每个节点都可以提出其正在服务的客户的 ID,并且决策表明这些客户中的哪一个获得了座位。

The consensus problem is normally formalized as follows: one or more nodes may propose values, and the consensus algorithm decides on one of those values. In the seat-booking example, when several customers are concurrently trying to buy the last seat, each node handling a customer request may propose the ID of the customer it is serving, and the decision indicates which one of those customers got the seat.

在这种形式中,共识 算法必须满足以下属性[ 25 ]:

In this formalism, a consensus algorithm must satisfy the following properties [25]:xiii

统一协议
Uniform agreement

没有两个节点做出不同的决定。

No two nodes decide differently.

正直
Integrity

没有节点会做出两次决定。

No node decides twice.

有效性
Validity

如果一个节点决定了值v,那么v是由某个节点提议的。

If a node decides value v, then v was proposed by some node.

终止
Termination

每个不崩溃的节点最终都会决定一些值。

Every node that does not crash eventually decides some value.

统一协议和完整性属性定义了共识的核心思想:每个人都决定相同的结果,一旦决定就不能改变主意。有效性属性的存在主要是为了排除平凡的解决方案:例如,您可以拥有一个始终决定 的算法null,无论提出什么建议;该算法将满足一致性和完整性属性,但不满足有效性属性。

The uniform agreement and integrity properties define the core idea of consensus: everyone decides on the same outcome, and once you have decided, you cannot change your mind. The validity property exists mostly to rule out trivial solutions: for example, you could have an algorithm that always decides null, no matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but not the validity property.

如果您不关心容错性,那么满足前三个属性很容易:您可以将一个节点硬编码为“独裁者”,并让该节点做出所有决策。然而,如果该节点发生故障,那么系统将无法再做出任何决策。事实上,这就是我们在两阶段提交的情况下看到的:如果协调者失败,有疑问的参与者无法决定是提交还是中止。

If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can just hardcode one node to be the “dictator,” and let that node make all of the decisions. However, if that one node fails, then the system can no longer make any decisions. This is, in fact, what we saw in the case of two-phase commit: if the coordinator fails, in-doubt participants cannot decide whether to commit or abort.

终止属性形式化了容错的概念。它本质上是说,共识算法不能永远无所事事——换句话说,它必须取得进展。即使某些节点失败,其他节点仍然必须做出决定。(终止是一种活性属性,​​而其他三个是安全属性——请参阅 “安全性和活性”。)

The termination property formalizes the idea of fault tolerance. It essentially says that a consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make progress. Even if some nodes fail, the other nodes must still reach a decision. (Termination is a liveness property, whereas the other three are safety properties—see “Safety and liveness”.)

共识的系统模型假设,当一个节点“崩溃”时,它会突然消失并且永远不会回来。(想象一下发生了地震,而不是软件崩溃,包含您的节点的数据中心被山体滑坡摧毁了。您必须假设您的节点被埋在 30 英尺的泥土下,并且永远不会恢复在线状态。)在这个系统模型中,任何必须等待节点恢复的算法都无法满足终止属性。特别是2PC不满足终止的要求。

The system model of consensus assumes that when a node “crashes,” it suddenly disappears and never comes back. (Instead of a software crash, imagine that there is an earthquake, and the datacenter containing your node is destroyed by a landslide. You must assume that your node is buried under 30 feet of mud and is never going to come back online.) In this system model, any algorithm that has to wait for a node to recover is not going to be able to satisfy the termination property. In particular, 2PC does not meet the requirements for termination.

当然,如果所有节点都崩溃并且没有一个节点在运行,那么任何算法都不可能做出任何决定。算法可以容忍的故障数量是有限的:事实上,可以证明任何共识算法都需要至少大多数节点正常运行才能确保终止[67 ]。大多数人可以安全地形成法定人数(请参阅“阅读和写作的法定人数”)。

Of course, if all nodes crash and none of them are running, then it is not possible for any algorithm to decide anything. There is a limit to the number of failures that an algorithm can tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination [67]. That majority can safely form a quorum (see “Quorums for reading and writing”).

因此,终止属性取决于少于一半的节点崩溃或无法访问的假设。然而,大多数共识的实现都确保始终满足安全属性——协议、完整性和有效性——即使大多数节点失败或存在严重的网络问题[92 ]。因此,大规模的中断可以阻止系统处理请求,但它不能通过导致其做出无效决策来破坏共识系统。

Thus, the termination property is subject to the assumption that fewer than half of the nodes are crashed or unreachable. However, most implementations of consensus ensure that the safety properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or there is a severe network problem [92]. Thus, a large-scale outage can stop the system from being able to process requests, but it cannot corrupt the consensus system by causing it to make invalid decisions.

大多数共识算法都假设不存在拜占庭错误,如 “拜占庭错误”中所述。也就是说,如果节点没有正确遵循协议(例如,如果它向不同节点发送矛盾的消息),则可能会破坏协议的安全属性。只要少于三分之一的节点存在拜占庭故障,就有可能针对拜占庭故障达成稳健的共识 [ 25 , 93 ],但我们没有足够的空间在本书中详细讨论这些算法。

Most consensus algorithms assume that there are no Byzantine faults, as discussed in “Byzantine Faults”. That is, if a node does not correctly follow the protocol (for example, if it sends contradictory messages to different nodes), it may break the safety properties of the protocol. It is possible to make consensus robust against Byzantine faults as long as fewer than one-third of the nodes are Byzantine-faulty [25, 93], but we don’t have space to discuss those algorithms in detail in this book.

共识算法和全序广播

Consensus algorithms and total order broadcast

最著名的容错一致性算法是 Viewstamped Replication (VSR) [ 94 , 95 ]、Paxos [ 96 , 97 , 98 , 99 ] 、 Raft [ 22 , 100 , 101 ] 和 Zab [ 15 , 21 , 102 ] 。这些算法之间有不少相似之处,但又不相同[ 103]。在本书中,我们不会详细介绍不同算法:了解它们的一些共同的高级思想就足够了,除非您自己实现一个共识系统(这可能是不可取的) ——这很难[ 98 , 104 ])。

The best-known fault-tolerant consensus algorithms are Viewstamped Replication (VSR) [94, 95], Paxos [96, 97, 98, 99], Raft [22, 100, 101], and Zab [15, 21, 102]. There are quite a few similarities between these algorithms, but they are not the same [103]. In this book we won’t go into full details of the different algorithms: it’s sufficient to be aware of some of the high-level ideas that they have in common, unless you’re implementing a consensus system yourself (which is probably not advisable—it’s hard [98, 104]).

大多数这些算法实际上并不直接使用此处描述的形式模型(提出并决定单个值,同时满足协议、完整性、有效性和终止属性)。相反,它们决定一系列,这使得它们成为全序广播算法,如本章前面所讨论的(请参阅 “全序广播”)。

Most of these algorithms actually don’t directly use the formal model described here (proposing and deciding on a single value, while satisfying the agreement, integrity, validity, and termination properties). Instead, they decide on a sequence of values, which makes them total order broadcast algorithms, as discussed previously in this chapter (see “Total Order Broadcast”).

请记住,全序广播要求消息以相同的顺序传递到所有节点一次。如果你仔细想想,这相当于执行几轮共识:在每一轮中,节点提出他们接下来要发送的消息,然后决定按照总顺序传递的下一条消息[67 ]

Remember that total order broadcast requires messages to be delivered exactly once, in the same order, to all nodes. If you think about it, this is equivalent to performing several rounds of consensus: in each round, nodes propose the message that they want to send next, and then decide on the next message to be delivered in the total order [67].

因此,全序广播相当于重复多轮共识(每个共识决策对应一次消息传递):

So, total order broadcast is equivalent to repeated rounds of consensus (each consensus decision corresponding to one message delivery):

  • 由于共识的协议属性,所有节点决定以相同的顺序传递相同的消息。

  • Due to the agreement property of consensus, all nodes decide to deliver the same messages in the same order.

  • 由于完整性属性,消息不会重复。

  • Due to the integrity property, messages are not duplicated.

  • 由于有效性属性,消息不会被损坏,也不会凭空捏造。

  • Due to the validity property, messages are not corrupted and not fabricated out of thin air.

  • 由于终止属性,消息不会丢失。

  • Due to the termination property, messages are not lost.

Viewstamped Replication、Raft 和 Zab 直接实现全序广播,因为这比重复进行一次一个值的共识更有效。就 Paxos 而言,这种优化称为 Multi-Paxos。

Viewstamped Replication, Raft, and Zab implement total order broadcast directly, because that is more efficient than doing repeated rounds of one-value-at-a-time consensus. In the case of Paxos, this optimization is known as Multi-Paxos.

单领导者复制和共识

Single-leader replication and consensus

第 5 章中,我们讨论了单领导者复制(请参阅“领导者和追随者”),它将所有写入写入领导者并以相同的顺序将它们应用到追随者,从而使副本保持最新。这本质上不就是全序广播吗?为什么我们不必担心第 5 章中的共识问题?

In Chapter 5 we discussed single-leader replication (see “Leaders and Followers”), which takes all the writes to the leader and applies them to the followers in the same order, thus keeping replicas up to date. Isn’t this essentially total order broadcast? How come we didn’t have to worry about consensus in Chapter 5?

答案取决于如何选择领导人。如果领导者是由运营团队中的人员手动选择和配置的,那么您本质上有一种独裁的“共识算法”:只允许一个节点接受写入(即,决定复制中的写入顺序) log),如果该节点出现故障,系统将无法进行写入,直到操作员手动将另一个节点配置为领导者。这样的系统在实践中可以很好地工作,但它不满足共识的终止性,因为它需要人为干预才能取得进展。

The answer comes down to how the leader is chosen. If the leader is manually chosen and configured by the humans in your operations team, you essentially have a “consensus algorithm” of the dictatorial variety: only one node is allowed to accept writes (i.e., make decisions about the order of writes in the replication log), and if that node goes down, the system becomes unavailable for writes until the operators manually configure a different node to be the leader. Such a system can work well in practice, but it does not satisfy the termination property of consensus because it requires human intervention in order to make progress.

一些数据库执行自动领导者选举和故障转移,如果旧领导者失败,则将追随者提升为新领导者(请参阅“处理节点中断”)。这使我们更接近容错的全序广播,从而更接近解决共识。

Some databases perform automatic leader election and failover, promoting a follower to be the new leader if the old leader fails (see “Handling Node Outages”). This brings us closer to fault-tolerant total order broadcast, and thus to solving consensus.

然而,有一个问题。我们之前讨论过脑裂问题,并说过所有节点需要就领导者是谁达成一致,否则两个不同的节点可能都认为自己是领导者,从而使数据库陷入不一致的状态。因此,我们需要达成共识才能选举领导人。但如果这里描述的共识算法实际上是全序广播算法,而全序广播就像单领导者复制,而单领导者复制需要领导者,那么……

However, there is a problem. We previously discussed the problem of split brain, and said that all nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to be the leader, and consequently get the database into an inconsistent state. Thus, we need consensus in order to elect a leader. But if the consensus algorithms described here are actually total order broadcast algorithms, and total order broadcast is like single-leader replication, and single-leader replication requires a leader, then…

看来,要选出一个领导者,首先需要一个领导者。要解决共识,首先要解决共识。我们如何摆脱这个难题?

It seems that in order to elect a leader, we first need a leader. In order to solve consensus, we must first solve consensus. How do we break out of this conundrum?

纪元编号和法定人数

Epoch numbering and quorums

到目前为止讨论的所有共识协议都在内部以某种形式使用领导者,但它们并不能保证领导者是唯一的。相反,他们可以做出较弱的保证:协议定义一个纪元号(在 Paxos 中称为选票号,在 Viewstamped Replication 中称为视图号,在 Raft 中称为术语号)并保证在每个纪元内,领导者是唯一的。

All of the consensus protocols discussed so far internally use a leader in some form or another, but they don’t guarantee that the leader is unique. Instead, they can make a weaker guarantee: the protocols define an epoch number (called the ballot number in Paxos, view number in Viewstamped Replication, and term number in Raft) and guarantee that within each epoch, the leader is unique.

每当当前领导者被认为已死亡时,节点之间就会开始投票以选举新的领导者。这次选举被赋予一个递增的纪元号,因此纪元号是完全有序且单调递增的。如果两个不同 epoch 中的两个不同领导者之间发生冲突(可能是因为前一个领导者实际上并没有死),那么 epoch 编号较高的领导者占上风。

Every time the current leader is thought to be dead, a vote is started among the nodes to elect a new leader. This election is given an incremented epoch number, and thus epoch numbers are totally ordered and monotonically increasing. If there is a conflict between two different leaders in two different epochs (perhaps because the previous leader actually wasn’t dead after all), then the leader with the higher epoch number prevails.

在领导者被允许做出任何决定之前,它必须首先检查是否存在其他具有更高纪元号的领导者可能会做出相互冲突的决定。领导者如何知道它没有被另一个节点驱逐?回想一下“真理是由多数人决定的”:一个节点不一定相信自己的判断——仅仅因为一个节点认为自己是领导者,并不一定意味着其他节点接受它作为领导者。

Before a leader is allowed to decide anything, it must first check that there isn’t some other leader with a higher epoch number which might take a conflicting decision. How does a leader know that it hasn’t been ousted by another node? Recall “The Truth Is Defined by the Majority”: a node cannot necessarily trust its own judgment—just because a node thinks that it is the leader, that does not necessarily mean the other nodes accept it as their leader.

相反,它必须从法定人数的节点中收集投票(请参阅“读写法定人数”)。对于领导者想要做出的每个决定,它必须将提议值发送给其他节点,并等待法定数量的节点响应支持该提议。法定人数通常(但并非总是)由大多数节点组成[ 105 ]。仅当节点不知道任何其他具有更高纪元的领导者时,节点才会投票赞成提案。

Instead, it must collect votes from a quorum of nodes (see “Quorums for reading and writing”). For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal. The quorum typically, but not always, consists of a majority of nodes [105]. A node votes in favor of a proposal only if it is not aware of any other leader with a higher epoch.

因此,我们有两轮投票:一次是选择领导者,第二次是对领导者的提案进行投票。关键的见解是这两次投票的法定人数必须重叠:如果对提案的投票成功,则至少有一个投票支持该提案的节点也必须参与最近的领导者选举[105 ]。因此,如果对提案的投票没有显示任何更高纪元数,则当前领导者可以得出结论,没有发生具有更高纪元数的领导者选举,因此可以确保它仍然拥有领导权。然后它可以安全地决定建议值。

Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal. The key insight is that the quorums for those two votes must overlap: if a vote on a proposal succeeds, at least one of the nodes that voted for it must have also participated in the most recent leader election [105]. Thus, if the vote on a proposal does not reveal any higher-numbered epoch, the current leader can conclude that no leader election with a higher epoch number has happened, and therefore be sure that it still holds the leadership. It can then safely decide the proposed value.

这个投票过程表面上看起来类似于两阶段提交。最大的区别在于,在 2PC 中,协调员不是选举出来的,容错共识算法只需要大多数节点的投票,而 2PC 需要每个参与者都投“是”。此外,共识算法定义了一个恢复过程,节点在选出新的领导者后可以进入一致状态,确保始终满足安全属性。这些差异是共识算法的正确性和容错性的关键。

This voting process looks superficially similar to two-phase commit. The biggest differences are that in 2PC the coordinator is not elected, and that fault-tolerant consensus algorithms only require votes from a majority of nodes, whereas 2PC requires a “yes” vote from every participant. Moreover, consensus algorithms define a recovery process by which nodes can get into a consistent state after a new leader is elected, ensuring that the safety properties are always met. These differences are key to the correctness and fault tolerance of a consensus algorithm.

共识的局限性

Limitations of consensus

共识算法对于分布式系统来说是一个巨大的突破:它们为其他一切都不确定的系统带来了具体的安全属性(一致性、完整性和有效性),并且它们仍然保持容错性(只要大多数节点都在运行,就能够取得进展)正在工作且可达)。它们提供全序广播,因此它们还可以以容错的方式实现线性化原子操作(请参阅“使用全序广播实现线性化存储”)。

Consensus algorithms are a huge breakthrough for distributed systems: they bring concrete safety properties (agreement, integrity, and validity) to systems where everything else is uncertain, and they nevertheless remain fault-tolerant (able to make progress as long as a majority of nodes are working and reachable). They provide total order broadcast, and therefore they can also implement linearizable atomic operations in a fault-tolerant way (see “Implementing linearizable storage using total order broadcast”).

然而,它们并没有到处使用,因为好处是有代价的。

Nevertheless, they are not used everywhere, because the benefits come at a cost.

节点在决定提案之前对其进行投票的过程是一种同步复制。正如“同步与异步复制”中所讨论的,数据库通常配置为使用异步复制。在此配置中,一些已提交的数据可能会在故障转移时丢失,但许多人为了获得更好的性能而选择接受这种风险。

The process by which nodes vote on proposals before they are decided is a kind of synchronous replication. As discussed in “Synchronous Versus Asynchronous Replication”, databases are often configured to use asynchronous replication. In this configuration, some committed data can potentially be lost on failover—but many people choose to accept this risk for the sake of better performance.

共识系统始终需要严格多数才能运行。这意味着您至少需要三个节点才能容忍一次故障(三个中的其余两个构成多数),或者至少需要五个节点才能容忍两次故障(五个中的其余三个构成多数)。如果网络故障将某些节点与其余节点切断,则只有网络的大部分可以取得进展,其余部分将被阻止(另请参阅“线性化成本”)。

Consensus systems always require a strict majority to operate. This means you need a minimum of three nodes in order to tolerate one failure (the remaining two out of three form a majority), or a minimum of five nodes to tolerate two failures (the remaining three out of five form a majority). If a network failure cuts off some nodes from the rest, only the majority portion of the network can make progress, and the rest is blocked (see also “The Cost of Linearizability”).

大多数共识算法都假设有一组固定的节点参与投票,这意味着您不能只在集群中添加或删除节点。共识算法的动态成员资格扩展允许集群中的节点集随时间变化,但它们比静态成员资格算法更难理解。

Most consensus algorithms assume a fixed set of nodes that participate in voting, which means that you can’t just add or remove nodes in the cluster. Dynamic membership extensions to consensus algorithms allow the set of nodes in the cluster to change over time, but they are much less well understood than static membership algorithms.

共识系统通常依靠超时来检测故障节点。在网络延迟高度变化的环境中,尤其是地理分布式系统中,经常会发生节点错误地认为领导者由于瞬态网络问题而发生故障的情况。尽管此错误不会损害安全性,但频繁的领导者选举会导致糟糕的性能,因为系统最终可能会花费更多的时间选择领导者而不是做任何有用的工作。

Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly variable network delays, especially geographically distributed systems, it often happens that a node falsely believes the leader to have failed due to a transient network issue. Although this error does not harm the safety properties, frequent leader elections result in terrible performance because the system can end up spending more time choosing a leader than doing any useful work.

有时,共识算法对网络问题特别敏感。例如,Raft 已被证明存在令人不快的边缘情况 [ 106 ]:如果整个网络工作正常,除了一个特定的网络链路始终不可靠,Raft 可能会陷入领导权在两个节点之间不断跳动的情况,或者当前的节点领导者不断被迫辞职,因此系统实际上永远不会取得进展。其他共识算法也存在类似的问题,设计对不可靠网络更鲁棒的算法仍然是一个开放的研究问题。

Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft has been shown to have unpleasant edge cases [106]: if the entire network is working correctly except for one particular network link that is consistently unreliable, Raft can get into situations where leadership continually bounces between two nodes, or the current leader is continually forced to resign, so the system effectively never makes progress. Other consensus algorithms have similar problems, and designing algorithms that are more robust to unreliable networks is still an open research problem.

会员和协调服务

Membership and Coordination Services

像 ZooKeeper 或 etcd 这样的项目通常被描述为“分布式键值存储”或“协调和配置服务”。此类服务的 API 看起来非常类似于数据库的 API:您可以读取和写入给定键的值,并迭代键。那么,如果它们基本上是数据库,为什么它们要花费所有精力来实现共识算法呢?它们与其他类型的数据库有何不同?

Projects like ZooKeeper or etcd are often described as “distributed key-value stores” or “coordination and configuration services.” The API of such a service looks pretty much like that of a database: you can read and write the value for a given key, and iterate over keys. So if they’re basically databases, why do they go to all the effort of implementing a consensus algorithm? What makes them different from any other kind of database?

为了理解这一点,简要探讨一下像 ZooKeeper 这样的服务是如何使用的会很有帮助。作为应用程序开发人员,您很少需要直接使用 ZooKeeper,因为它实际上不太适合作为通用数据库。您最终更有可能通过其他一些项目间接依赖它:例如,HBase、Hadoop YARN、OpenStack Nova 和 Kafka 都依赖于在后台运行的 ZooKeeper。这些项目从中得到了什么?

To understand this, it is helpful to briefly explore how a service like ZooKeeper is used. As an application developer, you will rarely need to use ZooKeeper directly, because it is actually not well suited as a general-purpose database. It is more likely that you will end up relying on it indirectly via some other project: for example, HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background. What is it that these projects get from it?

ZooKeeper 和 etcd 旨在保存完全适合内存的少量数据(尽管它们仍会写入磁盘以保证持久性),因此您不会希望将应用程序的所有数据都存储在这里。使用容错全序广播算法在所有节点上复制少量数据。如前所述,全序广播正是数据库复制所需要的:如果每条消息都代表对数据库的写入,则以相同的顺序应用相同的写入可以使副本彼此保持一致。

ZooKeeper and etcd are designed to hold small amounts of data that can fit entirely in memory (although they still write to disk for durability)—so you wouldn’t want to store all of your application’s data here. That small amount of data is replicated across all the nodes using a fault-tolerant total order broadcast algorithm. As discussed previously, total order broadcast is just what you need for database replication: if each message represents a write to the database, applying the same writes in the same order keeps replicas consistent with each other.

ZooKeeper 模仿了 Google 的 Chubby 锁服务 [ 14 , 98 ],不仅实现了全序广播(从而达成共识),而且还实现了一组有趣的其他功能,这些功能在构建分布式系统时特别有用:

ZooKeeper is modeled after Google’s Chubby lock service [14, 98], implementing not only total order broadcast (and hence consensus), but also an interesting set of other features that turn out to be particularly useful when building distributed systems:

线性化原子操作
Linearizable atomic operations

使用原子比较和设置操作,您可以实现锁定:如果多个节点同时尝试执行相同的操作,则只有其中一个会成功。即使节点发生故障或网络在任何时候中断,共识协议也保证操作是原子的和线性化的。分布式锁通常以租约的形式实现,租约有一个到期时间,以便在客户端失败时最终被释放(请参阅 “进程暂停”)。

Using an atomic compare-and-set operation, you can implement a lock: if several nodes concurrently try to perform the same operation, only one of them will succeed. The consensus protocol guarantees that the operation will be atomic and linearizable, even if a node fails or the network is interrupted at any point. A distributed lock is usually implemented as a lease, which has an expiry time so that it is eventually released in case the client fails (see “Process Pauses”).

操作总排序
Total ordering of operations

正如“领导者和锁”中所讨论的,当某些资源受到锁或租约保护时,您需要一个隔离令牌来防止客户端在进程暂停的情况下相互冲突。隔离令牌是每次获取锁时单调增加的某个数字。ZooKeeper 通过对所有操作进行完全排序并为每个操作提供单调递增的事务 ID ( zxid) 和版本号 ( cversion) 来提供此功能 [ 15 ]。

As discussed in “The leader and the lock”, when some resource is protected by a lock or lease, you need a fencing token to prevent clients from conflicting with each other in the case of a process pause. The fencing token is some number that monotonically increases every time the lock is acquired. ZooKeeper provides this by totally ordering all operations and giving each operation a monotonically increasing transaction ID (zxid) and version number (cversion) [15].

故障检测
Failure detection

客户端在 ZooKeeper 服务器上维护一个长期会话,并且客户端和服务器定期交换心跳以检查另一个节点是否仍然活动。即使连接暂时中断,或者 ZooKeeper 节点发生故障,会话仍保持活动状态。但是,如果心跳停止的持续时间长于会话超时,ZooKeeper 会声明会话已终止。会话持有的任何锁都可以配置为在会话超时时自动释放(ZooKeeper 将这些锁称为临时节点)。

Clients maintain a long-lived session on ZooKeeper servers, and the client and server periodically exchange heartbeats to check that the other node is still alive. Even if the connection is temporarily interrupted, or a ZooKeeper node fails, the session remains active. However, if the heartbeats cease for a duration that is longer than the session timeout, ZooKeeper declares the session to be dead. Any locks held by a session can be configured to be automatically released when the session times out (ZooKeeper calls these ephemeral nodes).

更改通知
Change notifications

一个客户端不仅可以读取另一客户端创建的锁和值,还可以监视它们的更改。因此,客户端可以发现另一个客户端何时加入集群(基于它写入 ZooKeeper 的值),或者另一个客户端是否失败(因为其会话超时且其临时节点消失)。通过订阅通知,客户端可以避免频繁轮询以了解更改。

Not only can one client read locks and values that were created by another client, but it can also watch them for changes. Thus, a client can find out when another client joins the cluster (based on the value it writes to ZooKeeper), or if another client fails (because its session times out and its ephemeral nodes disappear). By subscribing to notifications, a client avoids having to frequently poll to find out about changes.

在这些功能中,只有线性化原子操作真正需要共识。然而,正是这些功能的组合使得 ZooKeeper 这样的系统对于分布式协调非常有用。

Of these features, only the linearizable atomic operations really require consensus. However, it is the combination of these features that makes systems like ZooKeeper so useful for distributed coordination.

将工作分配给节点

Allocating work to nodes

ZooKeeper/Chubby 模型运行良好的一个示例是,如果您有一个进程或服务的多个实例,并且需要选择其中一个作为领导者或主要实例。如果领导者失败,其他节点之一应该接管。这对于单领导数据库当然有用,但对于作业调度程序和类似的有状态系统也很有用。

One example in which the ZooKeeper/Chubby model works well is if you have several instances of a process or service, and one of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should take over. This is of course useful for single-leader databases, but it’s also useful for job schedulers and similar stateful systems.

当您有一些分区资源(数据库、消息流、文件存储、分布式参与者系统等)并且需要决定将哪个分区分配给哪个节点时,就会出现另一个示例。当新节点加入集群时,一些分区需要从现有节点移动到新节点,以重新平衡负载(请参阅“重新平衡分区”)。当节点被删除或发生故障时,其他节点需要接管故障节点的工作。

Another example arises when you have some partitioned resource (database, message streams, file storage, distributed actor system, etc.) and need to decide which partition to assign to which node. As new nodes join the cluster, some of the partitions need to be moved from existing nodes to the new nodes in order to rebalance the load (see “Rebalancing Partitions”). As nodes are removed or fail, other nodes need to take over the failed nodes’ work.

这些类型的任务可以通过明智地使用 ZooKeeper 中的原子操作、临时节点和通知来实现。如果操作正确,这种方法允许应用程序自动从故障中恢复,无需人工干预。尽管出现了 Apache Curator [ 17 ]等库,它们在 ZooKeeper 客户端 API 之上提供了更高级别的工具,但这并不容易,但这仍然比尝试从头开始实现必要的共识算法要好得多,其成功记录很差[ 107 ]。

These kinds of tasks can be achieved by judicious use of atomic operations, ephemeral nodes, and notifications in ZooKeeper. If done correctly, this approach allows the application to automatically recover from faults without human intervention. It’s not easy, despite the appearance of libraries such as Apache Curator [17] that have sprung up to provide higher-level tools on top of the ZooKeeper client API—but it is still much better than attempting to implement the necessary consensus algorithms from scratch, which has a poor success record [107].

应用程序最初可能仅在单个节点上运行,但最终可能会增长到数千个节点。尝试在如此多的节点上进行多数投票将是非常低效的。相反,ZooKeeper 在固定数量的节点(通常为三个或五个)上运行,并在这些节点中执行多数投票,同时支持潜在的大量客户端。因此,ZooKeeper 提供了一种将一些协调节点的工作(共识、操作排序和故障检测)“外包”给外部服务的方法。

An application may initially run only on a single node, but eventually may grow to thousands of nodes. Trying to perform majority votes over so many nodes would be terribly inefficient. Instead, ZooKeeper runs on a fixed number of nodes (usually three or five) and performs its majority votes among those nodes while supporting a potentially large number of clients. Thus, ZooKeeper provides a way of “outsourcing” some of the work of coordinating nodes (consensus, operation ordering, and failure detection) to an external service.

通常,ZooKeeper 管理的数据类型变化相当缓慢:它表示诸如“在 10.1.1.23 上运行的节点是分区 7 的领导者”之类的信息,这些信息可能会在几分钟或几小时的时间尺度上发生变化。它不适用于存储应用程序的运行时状态,应用程序的运行时状态可能每秒更改数千甚至数百万次。如果需要将应用程序状态从一个节点复制到另一个节点,可以使用其他工具(例如 Apache BookKeeper [ 108 ])。

Normally, the kind of data managed by ZooKeeper is quite slow-changing: it represents information like “the node running on 10.1.1.23 is the leader for partition 7,” which may change on a timescale of minutes or hours. It is not intended for storing the runtime state of the application, which may change thousands or even millions of times per second. If application state needs to be replicated from one node to another, other tools (such as Apache BookKeeper [108]) can be used.

服务发现

Service discovery

ZooKeeper、etcd 和 Consul 也经常用于服务发现,即找出需要连接到哪个 IP 地址才能访问特定服务。在云数据中心环境中,虚拟机经常出现和消失,您通常无法提前知道服务的 IP 地址。相反,您可以配置您的服务,以便在它们启动时在服务注册表中注册其网络端点,然后其他服务可以在其中找到它们。

ZooKeeper, etcd, and Consul are also often used for service discovery—that is, to find out which IP address you need to connect to in order to reach a particular service. In cloud datacenter environments, where it is common for virtual machines to continually come and go, you often don’t know the IP addresses of your services ahead of time. Instead, you can configure your services such that when they start up they register their network endpoints in a service registry, where they can then be found by other services.

然而,尚不清楚服务发现是否真的需要达成共识。DNS 是查找服务名称的 IP 地址的传统方式,它使用多层缓存来实现良好的性能和可用性。DNS 读取绝对不可线性化,如果 DNS 查询的结果有点陈旧,通常不会被认为是有问题的 [ 109 ]。更重要的是 DNS 可靠可用并且对网络中断具有鲁棒性。

However, it is less clear whether service discovery actually requires consensus. DNS is the traditional way of looking up the IP address for a service name, and it uses multiple layers of caching to achieve good performance and availability. Reads from DNS are absolutely not linearizable, and it is usually not considered problematic if the results from a DNS query are a little stale [109]. It is more important that DNS is reliably available and robust to network interruptions.

尽管服务发现不需要达成共识,但领导者选举需要达成共识。因此,如果您的共识系统已经知道谁是领导者,那么使用该信息来帮助其他服务发现谁是领导者也是有意义的。为此,一些共识系统支持只读缓存副本。这些副本异步接收共识算法所有决策的日志,但不主动参与投票。因此,它们能够服务不需要线性化的读取请求。

Although service discovery does not require consensus, leader election does. Thus, if your consensus system already knows who the leader is, then it can make sense to also use that information to help other services discover who the leader is. For this purpose, some consensus systems support read-only caching replicas. These replicas asynchronously receive the log of all decisions of the consensus algorithm, but do not actively participate in voting. They are therefore able to serve read requests that do not need to be linearizable.

会员服务

Membership services

ZooKeeper 及其朋友可以被视为会员服务 研究悠久历史的一部分,该研究可以追溯到 20 世纪 80 年代,对于构建高度可靠的系统(例如空中交通管制)非常重要 [ 110 ]。

ZooKeeper and friends can be seen as part of a long history of research into membership services, which goes back to the 1980s and has been important for building highly reliable systems, e.g., for air traffic control [110].

成员资格服务确定哪些节点当前是集群中的活动成员和实时成员。正如我们在第 8 章中看到的,由于无限的网络延迟,不可能可靠地检测另一个节点是否发生故障。但是,如果将故障检测与共识结合起来,节点就可以就哪些节点应被视为活动或不活动达成一致。

A membership service determines which nodes are currently active and live members of a cluster. As we saw throughout Chapter 8, due to unbounded network delays it’s not possible to reliably detect whether another node has failed. However, if you couple failure detection with consensus, nodes can come to an agreement about which nodes should be considered alive or not.

即使节点实际上还活着,仍然可能会被共识错误地宣布死亡。但对于系统来说,就哪些节点构成当前成员资格达成一致是非常有用的。例如,选择领导者可能意味着简单地选择当前成员中编号最少的成员,但如果不同节点对当前成员是谁有不同意见,则这种方法将不起作用。

It could still happen that a node is incorrectly declared dead by consensus, even though it is actually alive. But it is nevertheless very useful for a system to have agreement on which nodes constitute the current membership. For example, choosing a leader could mean simply choosing the lowest-numbered among the current members, but this approach would not work if different nodes have divergent opinions on who the current members are.

概括

Summary

在本章中,我们从几个不同的角度研究了一致性和共识的主题。我们深入研究了线性化,这是一种流行的一致性模型:它的目标是使复制的数据看起来好像只有一个副本,并使所有操作以原子方式对其进行操作。尽管线性化很有吸引力,因为它很容易理解——它使数据库的行为就像单线程程序中的变量一样——但它的缺点是速度慢,尤其是在网络延迟较大的环境中。

In this chapter we examined the topics of consistency and consensus from several different angles. We looked in depth at linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically. Although linearizability is appealing because it is easy to understand—it makes a database behave like a variable in a single-threaded program—it has the downside of being slow, especially in environments with large network delays.

我们还探索了因果关系,它对系统中的事件强加了顺序(根据因果关系,先发生什么,再发生什么)。与线性化将所有操作放在一个完全有序的时间线中不同,因果关系为我们提供了一个较弱的一致性模型:有些事情可以并发,因此版本历史就像一个具有分支和合并的时间线。因果一致性没有线性化的协调开销,并且对网络问题不太敏感。

We also explored causality, which imposes an ordering on events in a system (what happened before what, based on cause and effect). Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model: some things can be concurrent, so the version history is like a timeline with branching and merging. Causal consistency does not have the coordination overhead of linearizability and is much less sensitive to network problems.

然而,即使我们捕获因果排序(例如使用 Lamport 时间戳),我们也发现有些事情无法以这种方式实现:在“时间戳排序不够”中,我们考虑了确保用户名唯一并拒绝并发的示例同一用户名的注册。如果一个节点要接受注册,它需要以某种方式知道另一个节点没有同时在注册相同名称的过程中。这个问题使我们达成了 共识

However, even if we capture the causal ordering (for example using Lamport timestamps), we saw that some things cannot be implemented this way: in “Timestamp ordering is not sufficient” we considered the example of ensuring that a username is unique and rejecting concurrent registrations for the same username. If one node is going to accept a registration, it needs to somehow know that another node isn’t concurrently in the process of registering the same name. This problem led us toward consensus.

我们看到,达成共识意味着以所有节点都同意所决定的方式决定某件事,并且该决定是不可撤销的。经过一番挖掘,事实证明,各种各样的问题实际上都可以简化为共识,并且彼此等效(从某种意义上说,如果你有其中一个问题的解决方案,你可以轻松地将其转换为其中一个问题的解决方案)其他)。此类等效问题包括:

We saw that achieving consensus means deciding something in such a way that all nodes agree on what was decided, and such that the decision is irrevocable. With some digging, it turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other (in the sense that if you have a solution for one of them, you can easily transform it into a solution for one of the others). Such equivalent problems include:

线性化比较和设置寄存器
Linearizable compare-and-set registers

寄存器需要根据其当前值是否等于操作中给出的参数来自动决定是否设置其值。

The register needs to atomically decide whether to set its value, based on whether its current value equals the parameter given in the operation.

原子事务提交
Atomic transaction commit

数据库必须决定是提交还是中止分布式事务。

A database must decide whether to commit or abort a distributed transaction.

总订单广播
Total order broadcast

消息传递系统必须决定传递消息的顺序。

The messaging system must decide on the order in which to deliver messages.

锁和租约
Locks and leases

当多个客户端竞相抢夺锁或租约时,锁将决定哪一个成功获得它。

When several clients are racing to grab a lock or lease, the lock decides which one successfully acquired it.

会员/协调服务
Membership/coordination service

给定故障检测器(例如超时),系统必须决定哪些节点是活动的,哪些节点应该被视为死亡,因为它们的会话超时。

Given a failure detector (e.g., timeouts), the system must decide which nodes are alive, and which should be considered dead because their sessions timed out.

唯一性约束
Uniqueness constraint

当多个事务同时尝试使用同一键创建冲突记录时,约束必须决定允许哪一个记录,以及哪一个因违反约束而失败。

When several transactions concurrently try to create conflicting records with the same key, the constraint must decide which one to allow and which should fail with a constraint violation.

如果您只有一个节点,或者您愿意将决策能力分配给单个节点,那么所有这些都很简单。这就是单领导者数据库中发生的情况:所有决策权都归属于领导者,这就是为什么此类数据库能够提供线性化操作、唯一性约束、完全有序的复制日志等。

All of these are straightforward if you only have a single node, or if you are willing to assign the decision-making capability to a single node. This is what happens in a single-leader database: all the power to make decisions is vested in the leader, which is why such databases are able to provide linearizable operations, uniqueness constraints, a totally ordered replication log, and more.

然而,如果单个领导者发生故障,或者网络中断导致领导者无法访问,这样的系统将无法取得任何进展。处理这种情况有以下三种方法:

However, if that single leader fails, or if a network interruption makes the leader unreachable, such a system becomes unable to make any progress. There are three ways of handling that situation:

  1. 等待领导者恢复,并接受系统将在此期间被阻塞的事实。许多 XA/JTA 事务协调器选择此选项。这种方法并不能完全解决共识,因为它不满足终止属性:如果领导者不恢复,系统可能会永远被阻塞。

  2. Wait for the leader to recover, and accept that the system will be blocked in the meantime. Many XA/JTA transaction coordinators choose this option. This approach does not fully solve consensus because it does not satisfy the termination property: if the leader does not recover, the system can be blocked forever.

  3. 通过让人们选择新的领导节点并重新配置系统以使用它来手动进行故障转移。许多关系数据库都采用这种方法。这是一种“天灾”共识——计算机系统之外的人类操作员做出决定。故障转移的速度受到人类行动速度的限制,通常比计算机慢。

  4. Manually fail over by getting humans to choose a new leader node and reconfigure the system to use it. Many relational databases take this approach. It is a kind of consensus by “act of God”—the human operator, outside of the computer system, makes the decision. The speed of failover is limited by the speed at which humans can act, which is generally slower than computers.

  5. 使用算法自动选择新的领导者。这种方法需要共识算法,建议使用经过验证的算法来正确处理不利的网络条件[ 107 ]。

  6. Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse network conditions [107].

尽管单领导数据库可以提供线性化,而无需在每次写入时执行共识算法,但它仍然需要共识来维持其领导地位和领导地位变更。因此,从某种意义上说,拥有一个领导者只会“把罐子踢到路上”:仍然需要达成共识,只是在不同的地方,而且频率较低。好消息是存在用于共识的容错算法和系统,我们在本章中简要讨论了它们。

Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leadership and for leadership changes. Thus, in some sense, having a leader only “kicks the can down the road”: consensus is still required, only in a different place, and less frequently. The good news is that fault-tolerant algorithms and systems for consensus exist, and we briefly discussed them in this chapter.

像 ZooKeeper 这样的工具在提供应用程序可以使用的“外包”共识、故障检测和成员服务方面发挥着重要作用。它并不容易使用,但它比尝试开发自己的算法要好得多,该算法可以承受第 8 章中讨论的所有问题 。如果您发现自己想做一件可以简化为共识的事情,并且希望它具有容错能力,那么建议使用 ZooKeeper 之类的东西。

Tools like ZooKeeper play an important role in providing an “outsourced” consensus, failure detection, and membership service that applications can use. It’s not easy to use, but it is much better than trying to develop your own algorithms that can withstand all the problems discussed in Chapter 8. If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper.

然而,并非每个系统都必须需要共识:例如,无领导者和多领导者复制系统通常不使用全局共识。这些系统中发生的冲突(请参阅“处理写入冲突”)是不同领导者之间没有达成共识的结果,但也许这没关系:也许我们只需要在没有线性化的情况下应对,并学会更好地处理具有分支和分支的数据。合并版本历史。

Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus. The conflicts that occur in these systems (see “Handling Write Conflicts”) are a consequence of not having consensus across different leaders, but maybe that’s okay: maybe we simply need to cope without linearizability and learn to work better with data that has branching and merging version histories.

本章引用了大量关于分布式系统理论的研究。尽管理论论文和证明并不总是容易理解,有时会做出不切实际的假设,但它们对于指导该领域的实际工作非常有价值:它们帮助我们推理什么可以做、什么不可以做,并帮助我们找到违反直觉的方法其中分布式系统经常存在缺陷。如果您有时间,这些参考资料非常值得探索。

This chapter referenced a large body of research on the theory of distributed systems. Although the theoretical papers and proofs are not always easy to understand, and sometimes make unrealistic assumptions, they are incredibly valuable for informing practical work in this field: they help us reason about what can and cannot be done, and help us find the counterintuitive ways in which distributed systems are often flawed. If you have the time, the references are well worth exploring.

这就是本书第二部分的结尾,其中我们介绍了复制(第 5 章)、分区(第 6 章)、事务(第 7 章)、分布式系统故障模型(第 8 章),以及最后的一致性和共识(第 9 章)。现在我们已经奠定了坚实的理论基础,在第三部分中,我们将再次转向更实用的系统,并讨论如何从异构构建块构建强大的应用程序。

This brings us to the end of Part II of this book, in which we covered replication (Chapter 5), partitioning (Chapter 6), transactions (Chapter 7), distributed system failure models (Chapter 8), and finally consistency and consensus (Chapter 9). Now that we have laid a firm foundation of theory, in Part III we will turn once again to more practical systems, and discuss how to build powerful applications from heterogeneous building blocks.

脚注

图的一个微妙细节是它假设存在一个全局时钟,由水平轴表示。尽管真实系统通常没有准确的时钟(请参阅 “不可靠的时钟”),但这种假设是可以的:出于分析分布式算法的目的,我们可以假装存在准确的全局时钟,只要该算法不存在无法访问它[ 47 ]。相反,该算法只能看到由石英振荡器和 NTP 产生的实时近似值。

i A subtle detail of this diagram is that it assumes the existence of a global clock, represented by the horizontal axis. Even though real systems typically don’t have accurate clocks (see “Unreliable Clocks”), this assumption is okay: for the purposes of analyzing a distributed algorithm, we may pretend that an accurate global clock exists, as long as the algorithm doesn’t have access to it [47]. Instead, the algorithm can only see a mangled approximation of real time, as produced by a quartz oscillator and NTP.

ii如果读取与写入同时进行,则读取可能返回旧值或新值的寄存器称为常规寄存器[ 7 , 25 ]。

ii A register in which reads may return either the old or the new value if they are concurrent with a write is known as a regular register [7, 25].

iii严格来说,ZooKeeper 和 etcd 提供线性化写入,但读取可能会过时,因为默认情况下它们可以由任何一个副本提供服务。您可以选择请求线性化读取:etcd 将此称为仲裁读取[ 16sync() ],而在 ZooKeeper 中,您需要在读取之前调用[ 15 ];请参阅 “使用全序广播实现线性化存储”

iii Strictly speaking, ZooKeeper and etcd provide linearizable writes, but reads may be stale, since by default they can be served by any one of the replicas. You can optionally request a linearizable read: etcd calls this a quorum read [16], and in ZooKeeper you need to call sync() before the read [15]; see “Implementing linearizable storage using total order broadcast”.

iv对单领导数据库进行分区(分片),以便每个分区都有一个单独的领导者,不会影响线性化,因为它只是单对象保证。跨分区交易是另一回事(参见“分布式交易和共识”)。

iv Partitioning (sharding) a single-leader database, so that there is a separate leader per partition, does not affect linearizability, since it is only a single-object guarantee. Cross-partition transactions are a different matter (see “Distributed Transactions and Consensus”).

v这两种选择有时分别称为 CP(在网络分区下一致但不可用)和 AP(在网络分区下可用但不一致)。然而,这种分类方案有几个缺陷[ 9 ],所以最好避免。

v These two choices are sometimes known as CP (consistent but not available under network partitions) and AP (available but not consistent under network partitions), respectively. However, this classification scheme has several flaws [9], so it is best avoided.

vi正如 《实践中的网络故障》中所讨论的,本书使用分区 来指故意将大型数据集分解为较小的数据集(分片;请参阅 第 6 章)。相比之下,网络分区是一种特殊类型的网络故障,我们通常不会将其与其他类型的故障分开考虑。然而,由于它是CAP中的P,我们无法避免这种情况下的混乱。

vi As discussed in “Network Faults in Practice”, this book uses partitioning to refer to deliberately breaking down a large dataset into smaller ones (sharding; see Chapter 6). By contrast, a network partition is a particular type of network fault, which we normally don’t consider separately from other kinds of faults. However, since it’s the P in CAP, we can’t avoid the confusion in this case.

vii与因果关系不一致的全序很容易创建,但不是很有用。例如,您可以为每个操作生成一个随机 UUID,并按字典顺序比较 UUID 以定义操作的总顺序。这是一个有效的全序,但随机 UUID 无法告诉您哪个操作实际上首先发生,或者这些操作是否是并发的。

vii A total order that is inconsistent with causality is easy to create, but not very useful. For example, you can generate a random UUID for each operation, and compare UUIDs lexicographically to define the total ordering of operations. This is a valid total order, but the random UUIDs tell you nothing about which operation actually happened first, or whether the operations were concurrent.

viii可以使物理时钟时间戳与因果关系一致:在“全局快照的同步时钟”中 ,我们讨论了 Google 的 Spanner,它估计预期的时钟偏差并在提交写入之前等待不确定性间隔。此方法确保为因果较晚的事务赋予更大的时间戳。然而,大多数时钟无法提供所需的不确定性度量。

viii It is possible to make physical clock timestamps consistent with causality: in “Synchronized clocks for global snapshots” we discussed Google’s Spanner, which estimates the expected clock skew and waits out the uncertainty interval before committing a write. This method ensures that a causally later transaction is given a greater timestamp. However, most clocks cannot provide the required uncertainty metric.

ix术语“原子广播”是传统的,但它非常令人困惑,因为它与“原子”一词的其他用法不一致:它与 ACID 事务中的原子性无关,仅与原子操作间接相关(在多线程的意义上)编程)或原子寄存器(线性化存储)。术语“全序多播”是另一个同义词。

ix The term atomic broadcast is traditional, but it is very confusing as it’s inconsistent with other uses of the word atomic: it has nothing to do with atomicity in ACID transactions and is only indirectly related to atomic operations (in the sense of multi-threaded programming) or atomic registers (linearizable storage). The term total order multicast is another synonym.

x从形式上讲,线性化读写寄存器是一个“更简单”的问题。全序广播相当于共识[ 67 ],在异步急停模型[68]中没有确定性解决方案线性化读写寄存器可以在同一系统模型[ 23,24,25 ]中实现。 然而,支持原子操作,例如寄存器中的比较和设置或增量和获取,使其相当于共识[ 28 ]。因此,共识问题和线性化寄存器密切相关。

x In a formal sense, a linearizable read-write register is an “easier” problem. Total order broadcast is equivalent to consensus [67], which has no deterministic solution in the asynchronous crash-stop model [68], whereas a linearizable read-write register can be implemented in the same system model [23, 24, 25]. However, supporting atomic operations such as compare-and-set or increment-and-get in a register makes it equivalent to consensus [28]. Thus, the problems of consensus and a linearizable register are closely related.

xi如果您不等待,而是在写入入队后立即确认写入,您将得到类似于多核 x86 处理器的内存一致性模型的结果 [43 ]。该模型既不是线性化的,也不是顺序一致的。

xi If you don’t wait, but acknowledge the write immediately after it has been enqueued, you get something similar to the memory consistency model of multi-core x86 processors [43]. That model is neither linearizable nor sequentially consistent.

xii原子提交的形式化与共识略有不同:原子事务仅在 所有参与者投票同意时才能提交,并且如果任何参与者需要中止,则必须中止。允许共识来决定参与者之一提出的任何值。然而,原子提交和共识可以相互简化[ 70 , 71 ]。非阻塞原子提交比共识更难——参见“三阶段提交”

xii Atomic commit is formalized slightly differently from consensus: an atomic transaction can commit only if all participants vote to commit, and must abort if any participant needs to abort. Consensus is allowed to decide on any value that is proposed by one of the participants. However, atomic commit and consensus are reducible to each other [70, 71]. Nonblocking atomic commit is harder than consensus—see “Three-phase commit”.

xiii这种特殊的共识变体称为统一共识,相当于具有不可靠故障检测器的异步系统中的常规共识[ 71 ]。学术文献通常指的是进程而不是节点,但我们在这里使用节点是为了与本书的其余部分保持一致。

xiii This particular variant of consensus is called uniform consensus, which is equivalent to regular consensus in asynchronous systems with unreliable failure detectors [71]. The academic literature usually refers to processes rather than nodes, but we use nodes here for consistency with the rest of this book.

参考

[ 1 ] Peter Bailis 和 Ali Ghodsi:“当今的最终一致性:限制、扩展及超越” , ACM Queue,第 11 卷,第 3 期,第 55-63 页,2013 年 3 月 。doi:10.1145/2460276.2462076

[1] Peter Bailis and Ali Ghodsi: “Eventual Consistency Today: Limitations, Extensions, and Beyond,” ACM Queue, volume 11, number 3, pages 55-63, March 2013. doi:10.1145/2460276.2462076

[ 2 ] Prince Mahajan、Lorenzo Alvisi 和 Mike Dahlin:“一致性、可用性和融合”,德克萨斯大学奥斯汀分校计算机科学系,技术报告 UTCS TR-11-22,2011 年 5 月。

[2] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin: “Consistency, Availability, and Convergence,” University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011.

[ 3 ] Alex Scotti:“构建您自己的数据库的冒险”,All Your Base,2015 年 11 月。

[3] Alex Scotti: “Adventures in Building Your Own Database,” at All Your Base, November 2015.

[ 4 ] Peter Bailis、Aaron Davidson、Alan Fekete 等人:“高可用性事务:优点和局限性”,第 40 届超大型数据库国际会议(VLDB),2014 年 9 月。扩展版本作为预印本 arXiv 发布:1302.0309 [cs.DB]。

[4] Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “Highly Available Transactions: Virtues and Limitations,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014. Extended version published as pre-print arXiv:1302.0309 [cs.DB].

[ 5 ] Paolo Viotti 和 Marko Vukolić:“非事务性分布式存储系统的一致性”,arXiv:1512.00168,2016 年 4 月 12 日。

[5] Paolo Viotti and Marko Vukolić: “Consistency in Non-Transactional Distributed Storage Systems,” arXiv:1512.00168, 12 April 2016.

[ 6 ] Maurice P. Herlihy 和 Jeannette M. Wing:“线性化:并发对象的正确性条件”,ACM Transactions on Programming Languages and Systems (TOPLAS),第 12 卷,第 3 期,第 463-492 页,1990 年 7 月 。 :10.1145/78969.78972

[6] Maurice P. Herlihy and Jeannette M. Wing: “Linearizability: A Correctness Condition for Concurrent Objects,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 12, number 3, pages 463–492, July 1990. doi:10.1145/78969.78972

[ 7 ] Leslie Lamport:“论进程间通信”,分布式计算,第 1 卷,第 2 期,第 77-101 页,1986 年 6 月。doi:10.1007/BF01786228

[7] Leslie Lamport: “On interprocess communication,” Distributed Computing, volume 1, number 2, pages 77–101, June 1986. doi:10.1007/BF01786228

[ 8 ] David K. Gifford:“分散式计算机系统中的信息存储”,施乐帕洛阿尔托研究中心,CSL-81-8,1981 年 6 月。

[8] David K. Gifford: “Information Storage in a Decentralized Computer System,” Xerox Palo Alto Research Centers, CSL-81-8, June 1981.

[ 9 ] Martin Kleppmann:“请停止调用数据库 CP 或 AP ”,martin.kleppmann.com,2015 年 5 月 11 日。

[9] Martin Kleppmann: “Please Stop Calling Databases CP or AP,” martin.kleppmann.com, May 11, 2015.

[ 10 ] Kyle Kingsbury:“ Call Me Maybe:MongoDB Stale Reads ”,aphyr.com,2015 年 4 月 20 日。

[10] Kyle Kingsbury: “Call Me Maybe: MongoDB Stale Reads,” aphyr.com, April 20, 2015.

[ 11 ] Kyle Kingsbury:“克诺索斯的计算技术”,aphyr.com,2014 年 5 月 17 日。

[11] Kyle Kingsbury: “Computational Techniques in Knossos,” aphyr.com, May 17, 2014.

[ 12 ] Peter Bailis:“线性化与串行化”,bailis.org,2014 年 9 月 24 日。

[12] Peter Bailis: “Linearizability Versus Serializability,” bailis.org, September 24, 2014.

[ 13 ] Philip A. Bernstein、Vassos Hadzilacos 和 Nathan Goodman: 数据库系统中的并发控制和恢复。Addison-Wesley,1987 年。ISBN:978-0-201-10715-9,可在Research.microsoft.com上在线获取。

[13] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at research.microsoft.com.

[ 14 ] Mike Burrows:“ The Chubby Lock Service for Loosely-Coupled Distributed Systems ”,第 7 届 USENIX 操作系统设计与实现(OSDI) 研讨会,2006 年 11 月。

[14] Mike Burrows: “The Chubby Lock Service for Loosely-Coupled Distributed Systems,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[ 15 ] Flavio P. Junqueira 和 Benjamin Reed: ZooKeeper:分布式进程协调。奥莱利媒体,2013 年。ISBN:978-1-449-36130-3

[15] Flavio P. Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination. O’Reilly Media, 2013. ISBN: 978-1-449-36130-3

[ 16 ]“ etcd 2.0.12 文档”,CoreOS, Inc.,2015 年。

[16] “etcd 2.0.12 Documentation,” CoreOS, Inc., 2015.

[ 17 ]“ Apache Curator ”,Apache 软件基金会,curator.apache.org,2015 年。

[17] “Apache Curator,” Apache Software Foundation, curator.apache.org, 2015.

[ 18 ] Morali Vallath: Oracle 10g RAC 网格、服务和集群。爱思唯尔数字出版社,2006 年。ISBN:978-1-555-58321-7

[18] Morali Vallath: Oracle 10g RAC Grid, Services & Clustering. Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7

[ 19 ] Peter Bailis、Alan Fekete、Michael J Franklin 等人:“避免协调的数据库系统”, VLDB Endowment 论文集,第 8 卷,第 3 期,第 185-196 页,2014 年 11 月。

[19] Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “Coordination-Avoiding Database Systems,” Proceedings of the VLDB Endowment, volume 8, number 3, pages 185–196, November 2014.

[ 20 ] Kyle Kingsbury:“ Call Me Maybe:etcd 和 Consul ”,aphyr.com,2014 年 6 月 9 日。

[20] Kyle Kingsbury: “Call Me Maybe: etcd and Consul,” aphyr.com, June 9, 2014.

[ 21 ] Flavio P. Junqueira、Benjamin C. Reed 和 Marco Serafini:“ Zab:主备份系统的高性能广播”,第 41 届 IEEE 国际可靠系统和网络会议(DSN),2011 年 6 月 。 10.1109/DSN.2011.5958223

[21] Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini: “Zab: High-Performance Broadcast for Primary-Backup Systems,” at 41st IEEE International Conference on Dependable Systems and Networks (DSN), June 2011. doi:10.1109/DSN.2011.5958223

[ 22 ]Diego Ongaro 和 John K. Ousterhout:“寻找可理解的共识算法(扩展版本) ”,USENIX 年度技术会议 (ATC),2014 年 6 月。

[22] Diego Ongaro and John K. Ousterhout: “In Search of an Understandable Consensus Algorithm (Extended Version),” at USENIX Annual Technical Conference (ATC), June 2014.

[ 23 ] Hagit Attiya、Amotz Bar-Noy 和 Danny Dolev:“在消息传递系统中稳健地共享内存”,ACM 杂志,第 42 卷,第 1 期,第 124–142 页,1995 年 1 月 。doi:10.1145/200836.200869

[23] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev: “Sharing Memory Robustly in Message-Passing Systems,” Journal of the ACM, volume 42, number 1, pages 124–142, January 1995. doi:10.1145/200836.200869

[ 24 ] Nancy Lynch 和 Alex Shvartsman:“使用动态仲裁确认广播对共享内存进行鲁棒仿真”,第 27 届国际容错计算研讨会(FTCS),1997 年 6 月 。doi:10.1109/FTCS.1997.614100

[24] Nancy Lynch and Alex Shvartsman: “Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts,” at 27th Annual International Symposium on Fault-Tolerant Computing (FTCS), June 1997. doi:10.1109/FTCS.1997.614100

[ 25 ] Christian Cachin、Rachid Guerraoui 和 Luís Rodrigues: 可靠和安全的分布式编程简介,第二版。施普林格,2011。ISBN:978-3-642-15259-7, doi:10.1007/978-3-642-15260-3

[25] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues: Introduction to Reliable and Secure Distributed Programming, 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, doi:10.1007/978-3-642-15260-3

[ 26 ] Sam Elliott、Mark Allen 和 Martin Kleppmann: 个人交流, twitter.com上的帖子,2015 年 10 月 15 日。

[26] Sam Elliott, Mark Allen, and Martin Kleppmann: personal communication, thread on twitter.com, October 15, 2015.

[ 27 ] Niklas Ekström、Mikhail Panchenko 和 Jonathan Ellis:“读修复可能存在问题?”,cassandra-dev邮件列表上的电子邮件主题,2012 年 10 月。

[27] Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis: “Possible Issue with Read Repair?,” email thread on cassandra-dev mailing list, October 2012.

[ 28 ] Maurice P. Herlihy:“无等待同步”, ACM Transactions on Programming Languages and Systems (TOPLAS),第 13 卷,第 1 期,第 124–149 页,1991 年 1 月 。doi:10.1145/114005.102808

[28] Maurice P. Herlihy: “Wait-Free Synchronization,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 13, number 1, pages 124–149, January 1991. doi:10.1145/114005.102808

[ 29 ]Armando Fox 和 Eric A. Brewer:“ Harvest、Yield 和 Scalable Tolerant Systems ”,第 7 届操作系统热门主题研讨会(HotOS),1999 年 3 月 。doi:10.1109/HOTOS.1999.798396

[29] Armando Fox and Eric A. Brewer: “Harvest, Yield, and Scalable Tolerant Systems,” at 7th Workshop on Hot Topics in Operating Systems (HotOS), March 1999. doi:10.1109/HOTOS.1999.798396

[ 30 ] Seth Gilbert 和 Nancy Lynch:“ Brewer 猜想和一致、可用、分区容忍 Web 服务的可行性”, ACM SIGACT 新闻,第 33 卷,第 2 期,第 51-59 页,2002 年 6 月 。doi:10.1145/564585.564601

[30] Seth Gilbert and Nancy Lynch: “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services,” ACM SIGACT News, volume 33, number 2, pages 51–59, June 2002. doi:10.1145/564585.564601

[ 31 ] Seth Gilbert 和 Nancy Lynch:“对 CAP 定理的看法”,IEEE 计算机杂志,第 45 卷,第 2 期,第 30-36 页,2012 年 2 月 。doi:10.1109/MC.2011.389

[31] Seth Gilbert and Nancy Lynch: “Perspectives on the CAP Theorem,” IEEE Computer Magazine, volume 45, number 2, pages 30–36, February 2012. doi:10.1109/MC.2011.389

[ 32 ] Eric A. Brewer:“ CAP 十二年后:‘规则’如何改变”,IEEE 计算机杂志,第 45 卷,第 2 期,第 23-29 页,2012 年 2 月 。doi:10.1109/MC.2012.37

[32] Eric A. Brewer: “CAP Twelve Years Later: How the ‘Rules’ Have Changed,” IEEE Computer Magazine, volume 45, number 2, pages 23–29, February 2012. doi:10.1109/MC.2012.37

[ 33 ] Susan B. Davidson、Hector Garcia-Molina 和 Dale Skeen:“分区网络的一致性”,ACM 计算调查,第 17 卷,第 3 期,第 341–370 页,1985 年 9 月 。doi:10.1145/5505.5508

[33] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen: “Consistency in Partitioned Networks,” ACM Computing Surveys, volume 17, number 3, pages 341–370, September 1985. doi:10.1145/5505.5508

[ 34 ] Paul R. Johnson 和 Robert H. Thomas:“ RFC 677:重复数据库的维护”,网络工作组,1975 年 1 月 27 日。

[34] Paul R. Johnson and Robert H. Thomas: “RFC 677: The Maintenance of Duplicate Databases,” Network Working Group, January 27, 1975.

[ 35 ] Bruce G. Lindsay、Patricia Griffiths Selinger、C. Galtieri 等人:“分布式数据库注释”,IBM Research,研究报告 RJ2571(33471),1979 年 7 月。

[35] Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “Notes on Distributed Databases,” IBM Research, Research Report RJ2571(33471), July 1979.

[ 36 ] Michael J. Fischer 和 Alan Michael:“牺牲可串行性以在不可靠的网络中获得数据的高可用性”, 第一届 ACM 数据库系统原理研讨会(PODS),1982 年 3 月 。doi:10.1145/588111.588124

[36] Michael J. Fischer and Alan Michael: “Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network,” at 1st ACM Symposium on Principles of Database Systems (PODS), March 1982. doi:10.1145/588111.588124

[ 37 ] Eric A. Brewer:“ NoSQL:过去、现在、未来”,旧金山 QCon,2012 年 11 月。

[37] Eric A. Brewer: “NoSQL: Past, Present, Future,” at QCon San Francisco, November 2012.

[ 38 ] Henry Robinson:“ CAP 混乱:‘分区容错’问题”,blog.cloudera.com,2010 年 4 月 26 日。

[38] Henry Robinson: “CAP Confusion: Problems with ‘Partition Tolerance,’blog.cloudera.com, April 26, 2010.

[ 39 ] Adrian Cockcroft:“迁移到微服务”,伦敦 QCon,2014 年 3 月。

[39] Adrian Cockcroft: “Migrating to Microservices,” at QCon London, March 2014.

[ 40 ] Martin Kleppmann:“对 CAP 定理的批判”,arXiv:1509.05393,2015 年 9 月 17 日。

[40] Martin Kleppmann: “A Critique of the CAP Theorem,” arXiv:1509.05393, September 17, 2015.

[ 41 ] Nancy A. Lynch:“分布式计算的一百个不可能性证明”,第 8 届 ACM 分布式计算原理研讨会(PODC),1989 年 8 月 。doi:10.1145/72981.72982

[41] Nancy A. Lynch: “A Hundred Impossibility Proofs for Distributed Computing,” at 8th ACM Symposium on Principles of Distributed Computing (PODC), August 1989. doi:10.1145/72981.72982

[ 42 ] Hagit Attiya、Faith Ellen 和 Adam Morrison:“高可用性最终一致数据存储的局限性”,ACM 分布式计算原理研讨会(PODC),2015 年 7 月 。doi:10.1145/2767386.2767419

[42] Hagit Attiya, Faith Ellen, and Adam Morrison: “Limitations of Highly-Available Eventually-Consistent Data Stores,” at ACM Symposium on Principles of Distributed Computing (PODC), July 2015. doi:10.1145/2767386.2767419

[ 43 ] Peter Sewell、Susmit Sarkar、Scott Owens 等人:“ x86-TSO:针对 x86 多处理器的严格且可用的程序员模型”,Communications of the ACM,第 53 卷,第 7 期,第 89-97 页,2010 年 7 月.doi :10.1145/1785414.1785443

[43] Peter Sewell, Susmit Sarkar, Scott Owens, et al.: “x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors,” Communications of the ACM, volume 53, number 7, pages 89–97, July 2010. doi:10.1145/1785414.1785443

[ 44 ] Martin Thompson:“记忆障碍/栅栏”,mechanical-sympathy.blogspot.co.uk,2011 年 7 月 24 日。

[44] Martin Thompson: “Memory Barriers/Fences,” mechanical-sympathy.blogspot.co.uk, July 24, 2011.

[ 45 ] Ulrich Drepper:“每个程序员都应该了解内存”,akkadia.org,2007 年 11 月 21 日。

[45] Ulrich Drepper: “What Every Programmer Should Know About Memory,” akkadia.org, November 21, 2007.

[ 46 ] Daniel J. Abadi:“现代分布式数据库系统设计中的一致性权衡”,IEEE 计算机杂志,第 45 卷,第 2 期,第 37-42 页,2012 年 2 月 。doi:10.1109/MC.2012.33

[46] Daniel J. Abadi: “Consistency Tradeoffs in Modern Distributed Database System Design,” IEEE Computer Magazine, volume 45, number 2, pages 37–42, February 2012. doi:10.1109/MC.2012.33

[ 47 ] Hagit Attiya 和 Jennifer L. Welch:“顺序一致性与线性化”,ACM Transactions on Computer Systems (TOCS),第 12 卷,第 2 期,第 91–122 页,1994 年 5 月 。doi:10.1145/176575.176576

[47] Hagit Attiya and Jennifer L. Welch: “Sequential Consistency Versus Linearizability,” ACM Transactions on Computer Systems (TOCS), volume 12, number 2, pages 91–122, May 1994. doi:10.1145/176575.176576

[ 48 ] Mustaque Ahamad、Gil Neiger、James E. Burns 等人:“因果记忆:定义、实现和编程”,分布式计算,第 9 卷,第 1 期,第 37-49 页,1995 年 3 月 。doi:10.1007 /BF01784241

[48] Mustaque Ahamad, Gil Neiger, James E. Burns, et al.: “Causal Memory: Definitions, Implementation, and Programming,” Distributed Computing, volume 9, number 1, pages 37–49, March 1995. doi:10.1007/BF01784241

[ 49 ] Wyatt Lloyd、Michael J. Freedman、Michael Kaminsky 和 ​​David G. Andersen:“更强的低延迟地理复制存储语义”,第 10 届 USENIX 网络系统设计和实现(NSDI) 研讨会,2013 年 4 月。

[49] Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen: “Stronger Semantics for Low-Latency Geo-Replicated Storage,” at 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2013.

[ 50 ] Marek Zawirski、Annette Bieniusa、Valter Balegas 等人:“ SwiftCloud:容错地理复制一直集成到客户端计算机”,INRIA 研究报告 8347,2013 年 8 月。

[50] Marek Zawirski, Annette Bieniusa, Valter Balegas, et al.: “SwiftCloud: Fault-Tolerant Geo-Replication Integrated All the Way to the Client Machine,” INRIA Research Report 8347, August 2013.

[ 51 ] Peter Bailis、Ali Ghodsi、Joseph M Hellerstein 和 Ion Stoica:“ Bolt-on 因果一致性”, ACM 国际数据管理会议(SIGMOD),2013 年 6 月。

[51] Peter Bailis, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica: “Bolt-on Causal Consistency,” at ACM International Conference on Management of Data (SIGMOD), June 2013.

[ 52 ] Philippe Ajoux、Nathan Bronson、Sanjeev Kumar 等人:“大规模采用更强一致性的挑战”,第 15 届 USENIX 操作系统热门主题研讨会(HotOS),2015 年 5 月。

[52] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “Challenges to Adopting Stronger Consistency at Scale,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[ 53 ] Peter Bailis:“因果关系是昂贵的(以及如何处理它) ”,bailis.org,2014 年 2 月 5 日。

[53] Peter Bailis: “Causality Is Expensive (and What to Do About It),” bailis.org, February 5, 2014.

[ 54 ] Ricardo Gonçalves、Paulo Sérgio Almeida、Carlos Baquero 和 Victor Fonte:“ Concise Server-Wide 因果关系管理,实现最终一致的数据存储”,第 15 届 IFIP 国际分布式应用程序和互操作系统(DAIS) 会议,2015 年 6 月 。 :10.1007/978-3-319-19129-4_6

[54] Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte: “Concise Server-Wide Causality Management for Eventually Consistent Data Stores,” at 15th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS), June 2015. doi:10.1007/978-3-319-19129-4_6

[ 55 ] Rob Conery:“更好的 PostgreSQL ID 生成器”,rob.conery.io,2014 年 5 月 29 日。

[55] Rob Conery: “A Better ID Generator for PostgreSQL,” rob.conery.io, May 29, 2014.

[ 56 ] Leslie Lamport:“分布式系统中的时间、时钟和事件顺序”,ACM 通讯,第 21 卷,第 7 期,第 558–565 页,1978 年 7 月 。doi:10.1145/359545.359563

[56] Leslie Lamport: “Time, Clocks, and the Ordering of Events in a Distributed System,” Communications of the ACM, volume 21, number 7, pages 558–565, July 1978. doi:10.1145/359545.359563

[ 57 ] Xavier Défago、André Schiper 和 Péter Urbán:“全序广播和组播算法:分类和调查”,ACM 计算调查,第 36 卷,第 4 期,第 372–421 页,2004 年 12 月 。doi:10.1145/1041680.1041682

[57] Xavier Défago, André Schiper, and Péter Urbán: “Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey,” ACM Computing Surveys, volume 36, number 4, pages 372–421, December 2004. doi:10.1145/1041680.1041682

[ 58 ] Hagit Attiya 和 Jennifer Welch:分布式计算:基础知识、模拟和高级主题,第二版。约翰·威利父子公司,2004 年。ISBN:978-0-471-45324-6, doi:10.1002/0471478210

[58] Hagit Attiya and Jennifer Welch: Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, doi:10.1002/0471478210

[ 59 ] Mahesh Balakrishnan、Dahlia Malkhi、Vijayan Prabhakaran 等人:“ CORFU:闪存集群的共享日志设计”,第 9 届 USENIX 网络系统设计和实现(NSDI) 研讨会,2012 年 4 月。

[59] Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, et al.: “CORFU: A Shared Log Design for Flash Clusters,” at 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012.

[ 60 ] Fred B. Schneider:“使用状态机方法实现容错服务:教程”,ACM 计算调查,第 22 卷,第 4 期,第 299-319 页,1990 年 12 月。

[60] Fred B. Schneider: “Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial,” ACM Computing Surveys, volume 22, number 4, pages 299–319, December 1990.

[ 61 ] Alexander Thomson、Thaddeus Diamond、Shu-Chun Weng 等人:“ Calvin:分区数据库系统的快速分布式事务”,ACM 国际数据管理会议(SIGMOD),2012 年 5 月。

[61] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, et al.: “Calvin: Fast Distributed Transactions for Partitioned Database Systems,” at ACM International Conference on Management of Data (SIGMOD), May 2012.

[ 62 ] Mahesh Balakrishnan、Dahlia Malkhi、Ted Wobber 等人:“ Tango:基于共享日志的分布式数据结构”,第 24 届 ACM 操作系统原理研讨会(SOSP),2013 年 11 月 。doi:10.1145/2517349.2522732

[62] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, et al.: “Tango: Distributed Data Structures over a Shared Log,” at 24th ACM Symposium on Operating Systems Principles (SOSP), November 2013. doi:10.1145/2517349.2522732

[ 63 ] Robbert van Renesse 和 Fred B. Schneider:“支持高吞吐量和可用性的链复制”,第六届 USENIX 操作系统设计和实现(OSDI) 研讨会,2004 年 12 月。

[63] Robbert van Renesse and Fred B. Schneider: “Chain Replication for Supporting High Throughput and Availability,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[ 64 ] Leslie Lamport:“如何制作正确执行多进程程序的多处理器计算机”,IEEE Transactions on Computers,第 28 卷,第 9 期,第 690–691 页,1979 年 9 月 。doi:10.1109/TC.1979.1675439

[64] Leslie Lamport: “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs,” IEEE Transactions on Computers, volume 28, number 9, pages 690–691, September 1979. doi:10.1109/TC.1979.1675439

[ 65 ] Enis Söztutar、Devaraj Das 和 Carter Shanklin:“ Apache HBase 高可用性更上一层楼”,hortonworks.com,2015 年 1 月 22 日。

[65] Enis Söztutar, Devaraj Das, and Carter Shanklin: “Apache HBase High Availability at the Next Level,” hortonworks.com, January 22, 2015.

[ 66 ] Brian F Coo​​per、Raghu Ramakrishnan、Utkarsh Srivastava 等人:“ PNUTS:Yahoo! 的托管数据服务平台”,第 34 届超大型数据库国际会议(VLDB),2008 年 8 月 。doi:10.14778/ 1454159.1454167

[66] Brian F Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, et al.: “PNUTS: Yahoo!’s Hosted Data Serving Platform,” at 34th International Conference on Very Large Data Bases (VLDB), August 2008. doi:10.14778/1454159.1454167

[ 67 ] Tushar Deepak Chandra 和 Sam Toueg:“可靠分布式系统的不可靠故障检测器”,ACM 杂志,第 43 卷,第 2 期,第 225–267 页,1996 年 3 月 。doi:10.1145/226643.226647

[67] Tushar Deepak Chandra and Sam Toueg: “Unreliable Failure Detectors for Reliable Distributed Systems,” Journal of the ACM, volume 43, number 2, pages 225–267, March 1996. doi:10.1145/226643.226647

[ 68 ] Michael J. Fischer、Nancy Lynch 和 Michael S. Paterson:“通过一个错误流程实现分布式共识的不可能” , ACM 杂志,第 32 卷,第 2 期,第 374-382 页,1985 年 4 月 。doi:10.1145 /3149.214121

[68] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson: “Impossibility of Distributed Consensus with One Faulty Process,” Journal of the ACM, volume 32, number 2, pages 374–382, April 1985. doi:10.1145/3149.214121

[ 69 ] Michael Ben-Or:“自由选择的另一个优势:完全异步协议协议”,第二届ACM 分布式计算原理研讨会(PODC),1983 年 8 月 。doi:10.1145/800221.806707

[69] Michael Ben-Or: “Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols,” at 2nd ACM Symposium on Principles of Distributed Computing (PODC), August 1983. doi:10.1145/800221.806707

[ 70 ] Jim N. Gray 和 Leslie Lamport:“事务提交共识”,ACM 数据库系统事务(TODS),第 31 卷,第 1 期,第 133–160 页,2006 年 3 月 。doi:10.1145/1132863.1132867

[70] Jim N. Gray and Leslie Lamport: “Consensus on Transaction Commit,” ACM Transactions on Database Systems (TODS), volume 31, number 1, pages 133–160, March 2006. doi:10.1145/1132863.1132867

[ 71 ] Rachid Guerraoui:“重新审视非阻塞原子承诺与共识之间的关系”,第 9 届国际分布式算法研讨会(WDAG),1995 年 9 月 。doi:10.1007/BFb0022140

[71] Rachid Guerraoui: “Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus,” at 9th International Workshop on Distributed Algorithms (WDAG), September 1995. doi:10.1007/BFb0022140

[ 72 ] Thanumalayan Sankaranarayana Pillai、Vijay Chidambaram、Ramnatthan Alagappan 等人:“并非所有文件系统都是生来平等的:论构建崩溃一致应用程序的复杂性”,第 11 届 USENIX 操作系统设计与实现研讨会(OSDI) ,2014 年 10 月。

[72] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications,” at 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2014.

[ 73 ] Jim Gray:“事务概念:优点和局限性”,第 7 届超大型数据库国际会议(VLDB),1981 年 9 月。

[73] Jim Gray: “The Transaction Concept: Virtues and Limitations,” at 7th International Conference on Very Large Data Bases (VLDB), September 1981.

[ 74 ] Hector Garcia-Molina 和 Kenneth Salem:“ Sagas ”, ACM 国际数据管理会议(SIGMOD),1987 年 5 月 。doi:10.1145/38713.38742

[74] Hector Garcia-Molina and Kenneth Salem: “Sagas,” at ACM International Conference on Management of Data (SIGMOD), May 1987. doi:10.1145/38713.38742

[ 75 ] C. Mohan、Bruce G. Lindsay 和 Ron Obermarck:“ R* 分布式数据库管理系统中的事务管理”, ACM 数据库系统事务,第 11 卷,第 4 期,第 378-396 页,1986 年 12 月。 doi :10.1145/7239.7266

[75] C. Mohan, Bruce G. Lindsay, and Ron Obermarck: “Transaction Management in the R* Distributed Database Management System,” ACM Transactions on Database Systems, volume 11, number 4, pages 378–396, December 1986. doi:10.1145/7239.7266

[ 76 ]“分布式事务处理:XA 规范”,X/Open Company Ltd.,技术标准 XO/CAE/91/300,1991 年 12 月。ISBN:978-1-872-63024-3

[76] “Distributed Transaction Processing: The XA Specification,” X/Open Company Ltd., Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3

[ 77 ] Mike Spille:“ XA 暴露,第二部分”, jroller.com,2004 年 4 月 3 日。

[77] Mike Spille: “XA Exposed, Part II,” jroller.com, April 3, 2004.

[ 78 ] Ivan Silva Neto 和 Francisco Reverbel:“ Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction ”,第 7 届 IEEE/ACIS 国际计算机和信息科学会议(ICIS),2008 年 5 月 。doi:10.1109/ICIS.2008.75

[78] Ivan Silva Neto and Francisco Reverbel: “Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction,” at 7th IEEE/ACIS International Conference on Computer and Information Science (ICIS), May 2008. doi:10.1109/ICIS.2008.75

[ 79 ] James E. Johnson、David E. Langworthy、Leslie Lamport 和 Friedrich H. Vogt:“ Web 服务协议的形式化规范”,第一届 Web 服务和形式化方法国际研讨会(WS-FM),2004 年 2 月.doi :10.1016/j.entcs.2004.02.022

[79] James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt: “Formal Specification of a Web Services Protocol,” at 1st International Workshop on Web Services and Formal Methods (WS-FM), February 2004. doi:10.1016/j.entcs.2004.02.022

[ 80 ] Dale Skeen:“非阻塞提交协议”,ACM 国际数据管理会议(SIGMOD),1981 年 4 月 。doi:10.1145/582318.582339

[80] Dale Skeen: “Nonblocking Commit Protocols,” at ACM International Conference on Management of Data (SIGMOD), April 1981. doi:10.1145/582318.582339

[ 81 ] Gregor Hohpe:“你的咖啡店不使用两阶段提交”,IEEE Software,第 22 卷,第 2 期,第 64–66 页,2005 年 3 月 。doi:10.1109/MS.2005.52

[81] Gregor Hohpe: “Your Coffee Shop Doesn’t Use Two-Phase Commit,” IEEE Software, volume 22, number 2, pages 64–66, March 2005. doi:10.1109/MS.2005.52

[ 82 ] Pat Helland:“超越分布式交易的生活:叛教者的观点”,第三届创新数据系统研究双年度会议(CIDR),2007 年 1 月。

[82] Pat Helland: “Life Beyond Distributed Transactions: An Apostate’s Opinion,” at 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.

[ 83 ] Jonathan Oliver:“我对 MSDTC 和两阶段提交的不满”,blog.jonathanoliver.com,2011 年 4 月 4 日。

[83] Jonathan Oliver: “My Beef with MSDTC and Two-Phase Commits,” blog.jonathanoliver.com, April 4, 2011.

[ 84 ] Oren Eini (Ahende Rahien):“分布式事务的谬误”,ayende.com,2014 年 7 月 17 日。

[84] Oren Eini (Ahende Rahien): “The Fallacy of Distributed Transactions,” ayende.com, July 17, 2014.

[ 85 ] Clemens Vasters:“ Windows Azure 中的事务(使用服务总线)——电子邮件讨论”,vasters.com,2012 年 7 月 30 日。

[85] Clemens Vasters: “Transactions in Windows Azure (with Service Bus) – An Email Discussion,” vasters.com, July 30, 2012.

[ 86 ]“了解 Azure 中的事务性”,NServiceBus 文档,特定软件,2015 年。

[86] “Understanding Transactionality in Azure,” NServiceBus Documentation, Particular Software, 2015.

[ 87 ] Randy Wigginton、Ryan Lowe、Marcos Albe 和 Fernando Ipar:“ MySQL 中的分布式事务”,MySQL 会议和博览会,2013 年 4 月。

[87] Randy Wigginton, Ryan Lowe, Marcos Albe, and Fernando Ipar: “Distributed Transactions in MySQL,” at MySQL Conference and Expo, April 2013.

[ 88 ] Mike Spille:“ XA 暴露,第一部分”, jroller.com,2004 年 4 月 3 日。

[88] Mike Spille: “XA Exposed, Part I,” jroller.com, April 3, 2004.

[ 89 ] Ajmer Dhariwal:“孤立的 MSDTC 交易 (-2 spid) ”,eraofdata.com,2008 年 12 月 12 日。

[89] Ajmer Dhariwal: “Orphaned MSDTC Transactions (-2 spids),” eraofdata.com, December 12, 2008.

[ 90 ] Paul Randal:“ DBCC PAGE 拯救世界的真实故事”,sqlskills.com,2013 年 6 月 19 日。

[90] Paul Randal: “Real World Story of DBCC PAGE Saving the Day,” sqlskills.com, June 19, 2013.

[ 91 ]“不确定的 xact 解析服务器配置选项”,SQL Server 2016 文档,Microsoft, Inc.,2016。

[91] “in-doubt xact resolution Server Configuration Option,” SQL Server 2016 documentation, Microsoft, Inc., 2016.

[ 92 ] Cynthia Dwork、Nancy Lynch 和 Larry Stockmeyer:“ Consensus in the Presence of Partial Synchrony ”,Journal of the ACM,第 35 卷,第 2 期,第 288–323 页,1988 年 4 月。doi:10.1145/42282.42283

[92] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “Consensus in the Presence of Partial Synchrony,” Journal of the ACM, volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283

[ 93 ] Miguel Castro 和 Barbara H. Liskov:“实用拜占庭容错和主动恢复”,ACM Transactions on Computer Systems,第 20 卷,第 4 期,第 396–461 页,2002 年 11 月 。doi:10.1145/571637.571640

[93] Miguel Castro and Barbara H. Liskov: “Practical Byzantine Fault Tolerance and Proactive Recovery,” ACM Transactions on Computer Systems, volume 20, number 4, pages 396–461, November 2002. doi:10.1145/571637.571640

[ 94 ] Brian M. Oki 和 Barbara H. Liskov:“ Viewstamped Replication:一种支持高可用性分布式系统的新主复制方法”, 第 7 届 ACM 分布式计算原理研讨会(PODC),1988 年 8 月 。doi:10.1145 /62546.62549

[94] Brian M. Oki and Barbara H. Liskov: “Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems,” at 7th ACM Symposium on Principles of Distributed Computing (PODC), August 1988. doi:10.1145/62546.62549

[ 95 ] Barbara H. Liskov 和 James Cowling:“ Viewstamped Replication Revisited ”,麻省理工学院,技术报告 MIT-CSAIL-TR-2012-021,2012 年 7 月。

[95] Barbara H. Liskov and James Cowling: “Viewstamped Replication Revisited,” Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012.

[ 96 ] Leslie Lamport:“兼职议会”,ACM Transactions on Computer Systems,第 16 卷,第 2 期,第 133–169 页,1998 年 5 月 。doi:10.1145/279227.279229

[96] Leslie Lamport: “The Part-Time Parliament,” ACM Transactions on Computer Systems, volume 16, number 2, pages 133–169, May 1998. doi:10.1145/279227.279229

[ 97 ] Leslie Lamport:“ Paxos Made Simple ”,ACM SIGACT News,第 32 卷,第 4 期,第 51-58 页,2001 年 12 月。

[97] Leslie Lamport: “Paxos Made Simple,” ACM SIGACT News, volume 32, number 4, pages 51–58, December 2001.

[ 98 ] Tushar Deepak Chandra、Robert Griesemer 和 Joshua Redstone:“ Paxos Made Live – An Engineering Perspective ”,第26 届 ACM 分布式计算原理研讨会(PODC),2007 年 6 月。

[98] Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone: “Paxos Made Live – An Engineering Perspective,” at 26th ACM Symposium on Principles of Distributed Computing (PODC), June 2007.

[ 99 ] Robbert van Renesse:“ Paxos 变得相当复杂”,cs.cornell.edu,2011 年 3 月。

[99] Robbert van Renesse: “Paxos Made Moderately Complex,” cs.cornell.edu, March 2011.

[ 100 ]迭戈·翁加罗(Diego Ongaro):“共识:理论与实践的桥梁”,博士论文,斯坦福大学,2014 年 8 月。

[100] Diego Ongaro: “Consensus: Bridging Theory and Practice,” PhD Thesis, Stanford University, August 2014.

[ 101 ] Heidi Howard、Malte Schwarzkopf、Anil Madhavapeddy 和 Jon Crowcroft:“木筏重新浮起:我们达成共识了吗?”,《ACM SIGOPS 操作系统评论》,第 49 卷,第 1 期,第 12–21 页,2015 年 1 月 。doi:10.1145/2723872.2723876

[101] Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft: “Raft Refloated: Do We Have Consensus?,” ACM SIGOPS Operating Systems Review, volume 49, number 1, pages 12–21, January 2015. doi:10.1145/2723872.2723876

[ 102 ] André Medeiros:“ ZooKeeper 的原子广播协议:理论与实践”,阿尔托大学理学院,2012 年 3 月 20 日。

[102] André Medeiros: “ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice,” Aalto University School of Science, March 20, 2012.

[ 103 ] Robbert van Renesse、Nicolas Schiper 和 Fred B. Schneider:“ Vive La Différence:Paxos、Viewstamped Replication 与 Zab ”,《IEEE Transactions on Dependable and SecureComputing》,第 12 卷,第 4 期,第 472-484 页, 2014 年 9 月 。doi:10.1109/TDSC.2014.2355848

[103] Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider: “Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab,” IEEE Transactions on Dependable and Secure Computing, volume 12, number 4, pages 472–484, September 2014. doi:10.1109/TDSC.2014.2355848

[ 104 ] Will Portnoy:“实施 Paxos 的经验教训”,blog.willportnoy.com,2012 年 6 月 14 日。

[104] Will Portnoy: “Lessons Learned from Implementing Paxos,” blog.willportnoy.com, June 14, 2012.

[ 105 ] Heidi Howard、Dahlia Malkhi 和 Alexander Spiegelman:“灵活的 Paxos:重新审视 Quorum 交叉点”, arXiv:1608.06696,2016年 8 月 24 日。

[105] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “Flexible Paxos: Quorum Intersection Revisited,” arXiv:1608.06696, August 24, 2016.

[ 106 ] Heidi Howard 和 Jon Crowcroft:“ Coracle:评估互联网边缘的共识”,ACM 数据通信特别兴趣小组(SIGCOMM) 年会,2015 年 8 月 。doi:10.1145/2829988.2790010

[106] Heidi Howard and Jon Crowcroft: “Coracle: Evaluating Consensus at the Internet Edge,” at Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), August 2015. doi:10.1145/2829988.2790010

[ 107 ] Kyle Kingsbury:“ Call Me Maybe:Elasticsearch 1.5.0 ”,aphyr.com,2015 年 4 月 27 日。

[107] Kyle Kingsbury: “Call Me Maybe: Elasticsearch 1.5.0,” aphyr.com, April 27, 2015.

[ 108 ] Ivan Kelly:“ BookKeeper 教程”, github.com,2014 年 10 月。

[108] Ivan Kelly: “BookKeeper Tutorial,” github.com, October 2014.

[ 109 ] Camille Fournier:“持怀疑态度的建筑师的共识系统”,工艺会议,匈牙利布达佩斯,2015 年 4 月。

[109] Camille Fournier: “Consensus Systems for the Skeptical Architect,” at Craft Conference, Budapest, Hungary, April 2015.

[ 110 ] Kenneth P. Birman:“虚拟同步复制模型的历史”,《复制:理论与实践》,Springer LNCS 第 5959 卷,第 6 章,第 91-120 页,2010 年。ISBN:978-3-642-11293 -5, doi:10.1007/978-3-642-11294-2_6

[110] Kenneth P. Birman: “A History of the Virtual Synchrony Replication Model,” in Replication: Theory and Practice, Springer LNCS volume 5959, chapter 6, pages 91–120, 2010. ISBN: 978-3-642-11293-5, doi:10.1007/978-3-642-11294-2_6

第三部分。派生数据

Part III. Derived Data

在本书的第一部分 和第二 部分中,我们从头开始汇总了分布式数据库的所有主要考虑因素,从磁盘上数据的布局一直到出现故障时分布式一致性的限制。然而,这一讨论假设应用程序中只有一个数据库。

In Parts I and II of this book, we assembled from the ground up all the major considerations that go into a distributed database, from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application.

事实上,数据系统往往更加复杂。在大型应用程序中,您通常需要能够以多种不同的方式访问和处理数据,并且没有一种数据库可以同时满足所有这些不同的需求。因此,应用程序通常使用几种不同的数据存储、索引、缓存、分析系统等的组合,并实现将数据从一个存储移动到另一个存储的机制。

In reality, data systems are often more complex. In a large application you often need to be able to access and process data in many different ways, and there is no one database that can satisfy all those different needs simultaneously. Applications thus commonly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another.

在本书的最后一部分中,我们将研究有关将多个不同数据系统(可能具有不同数据模型并针对不同访问模式进行优化)集成到一个连贯的应用程序架构中的问题。系统构建的这一方面经常被那些声称他们的产品可以满足您所有需求的供应商所忽视。实际上,集成不同的系统是一个重要的应用程序中需要完成的最重要的事情之一。

In this final part of the book, we will examine the issues around integrating multiple different data systems, potentially with different data models and optimized for different access patterns, into one coherent application architecture. This aspect of system-building is often overlooked by vendors who claim that their product can satisfy all your needs. In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application.

记录和派生数据系统

Systems of Record and Derived Data

在较高的层面上,存储和处理数据的系统可以分为两大类:

On a high level, systems that store and process data can be grouped into two broad categories:

记录系统
Systems of record

记录系统(也称为事实来源)保存数据的权威版本。当新数据进来时,例如,作为用户输入,它首先被写入这里。每个事实仅表示一次(该表示通常是标准化的)。如果另一个系统和记录系统之间存在任何差异,则记录系统中的值(根据定义)是正确的。

A system of record, also known as source of truth, holds the authoritative version of your data. When new data comes in, e.g., as user input, it is first written here. Each fact is represented exactly once (the representation is typically normalized). If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one.

衍生数据系统
Derived data systems

派生系统中的数据是从另一个系统获取一些现有数据并以某种方式转换或处理它的结果。如果丢失了派生数据,您可以从原始来源重新创建它。一个典型的例子是缓存:如果存在,可以从缓存中提供数据,但如果缓存不包含您需要的内容,您可以回退到底层数据库。非规范化值、索引和物化视图也属于这一类。在推荐系统中,预测摘要数据通常源自使用日志。

Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs.

从技术上讲,派生数据是冗余的,因为它重复了现有信息。然而,这通常对于获得良好的读取查询性能至关重要。它通常是非规范化的。您可以从一个来源导出多个不同的数据集,从而使您能够从不同的“角度”查看数据。

Technically speaking, derived data is redundant, in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. It is commonly denormalized. You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”

并非所有系统都在其架构中明确区分记录系统和派生数据,但这是一个非常有用的区分,因为它澄清了系统中的数据流:它明确了系统的哪些部分具有哪些输入和哪些输出,以及它们如何相互依赖。

Not all systems make a clear distinction between systems of record and derived data in their architecture, but it’s a very helpful distinction to make, because it clarifies the dataflow through your system: it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other.

大多数数据库、存储引擎和查询语言本质上都不是记录系统或派生系统。数据库只是一个工具:如何使用它取决于您。记录系统和派生数据系统之间的区别不取决于工具,而取决于您在应用程序中如何使用它。

Most databases, storage engines, and query languages are not inherently either a system of record or a derived system. A database is just a tool: how you use it is up to you. The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application.

通过清楚哪些数据源自哪些其他数据,您可以使原本令人困惑的系统架构变得清晰。这一点将成为贯穿本书这一部分的一个贯穿主题。

By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture. This point will be a running theme throughout this part of the book.

章节概述

Overview of Chapters

我们将从第 10 章开始,研究面向批处理的数据流系统(例如 MapReduce),并了解它们如何为我们提供构建大规模数据系统的良好工具和原则。在第 11 章中,我们将采用这些想法并将其应用到数据流中,这使我们能够以更低的延迟做同样的事情。 第 12 章通过探讨未来如何使用这些工具构建可靠、可扩展和可维护的应用程序的想法来总结本书。

We will start in Chapter 10 by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large-scale data systems. In Chapter 11 we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays. Chapter 12 concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.

第 10 章批处理

Chapter 10. Batch Processing

如果一个系统受一个人的影响太大,它就不可能成功。一旦初始设计完成并且相当稳健,真正的测试就开始了,持不同观点的人们开始自己的实验。

唐纳德·高德纳

A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.

Donald Knuth

在本书的前两部分中,我们讨论了很多关于请求查询以及相应的响应结果的内容。许多现代数据系统都采用这种数据处理方式:您提出请求,或者发送指令,一段时间后系统(希望)给您答案。数据库、缓存、搜索索引、Web 服务器和许多其他系统都是以这种方式工作的。

In the first two parts of this book we talked a lot about requests and queries, and the corresponding responses or results. This style of data processing is assumed in many modern data systems: you ask for something, or you send an instruction, and some time later the system (hopefully) gives you an answer. Databases, caches, search indexes, web servers, and many other systems work this way.

在这样的在线系统中,无论是网络浏览器请求页面还是调用远程API的服务,我们通常假设请求是由人类用户触发的,并且用户正在等待响应。他们不应该等待太久,因此我们非常关注这些系统的响应时间(请参阅“描述性能”)。

In such online systems, whether it’s a web browser requesting a page or a service calling a remote API, we generally assume that the request is triggered by a human user, and that the user is waiting for the response. They shouldn’t have to wait too long, so we pay a lot of attention to the response time of these systems (see “Describing Performance”).

Web 以及越来越多的基于 HTTP/REST 的 API 使得请求/响应交互方式变得如此普遍,以至于人们很容易认为这是理所当然的。但我们应该记住,这并不是构建系统的唯一方法,其他方法也有其优点。让我们区分三种不同类型的系统:

The web, and increasing numbers of HTTP/REST-based APIs, has made the request/response style of interaction so common that it’s easy to take it for granted. But we should remember that it’s not the only way of building systems, and that other approaches have their merits too. Let’s distinguish three different types of systems:

服务(在线系统)
Services (online systems)

服务等待来自客户端的请求或指令到达。当收到一个请求时,服务会尝试尽快处理它并发回响应。响应时间通常是衡量服务性能的主要指标,而可用性通常非常重要(如果客户端无法访问服务,用户可能会收到错误消息)。

A service waits for a request or instruction from a client to arrive. When one is received, the service tries to handle it as quickly as possible and sends a response back. Response time is usually the primary measure of performance of a service, and availability is often very important (if the client can’t reach the service, the user will probably get an error message).

批处理系统(离线系统)
Batch processing systems (offline systems)

批处理系统获取大量输入数据,运行作业处理它,并产生一些输出数据。作业通常需要一段时间(从几分钟到几天),因此通常没有用户等待作业完成。相反,批处理作业通常被安排定期运行(例如,每天一次)。批处理作业的主要性能指标通常是吞吐量(处理特定大小的输入数据集所需的时间)。我们在本章中讨论批处理。

A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn’t a user waiting for the job to finish. Instead, batch jobs are often scheduled to run periodically (for example, once a day). The primary performance measure of a batch job is usually throughput (the time it takes to crunch through an input dataset of a certain size). We discuss batch processing in this chapter.

流处理系统(近实时系统)
Stream processing systems (near-real-time systems)

流处理介于在线和离线/批处理之间(因此有时称为近实时近线处理)。与批处理系统一样,流处理器消耗输入并产生输出(而不是响应请求)。但是,流作业在事件发生后不久对其进行操作,而批处理作业则对一组固定的输入数据进行操作。这种差异使得流处理系统比同等的批处理系统具有更低的延迟。由于流处理建立在批处理的基础上,我们将在 第 11 章中对其进行讨论。

Stream processing is somewhere between online and offline/batch processing (so it is sometimes called near-real-time or nearline processing). Like a batch processing system, a stream processor consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing systems to have lower latency than the equivalent batch systems. As stream processing builds upon batch processing, we discuss it in Chapter 11.

正如我们将在本章中看到的,批处理是我们构建可靠、可扩展和可维护的应用程序的重要组成部分。例如,MapReduce,一种于 2004 年发布的批处理算法 [ 1 ],被(也许过于热情地)称为“使 Google 具有如此大规模可扩展性的算法”[ 2 ]。随后在各种开源数据系统中实现,包括 Hadoop、CouchDB 和 MongoDB。

As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, MapReduce, a batch processing algorithm published in 2004 [1], was (perhaps over-enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.

与多年前为数据仓库开发的并行处理系统相比,MapReduce 是一个相当低级的编程模型 [ 3 , 4 ],但就商品上可实现的处理规模而言,它是一个重大进步硬件。尽管 MapReduce 的重要性现在正在下降 [ 5 ],但它仍然值得理解,因为它清楚地说明了批处理为何以及如何有用。

MapReduce is a fairly low-level programming model compared to the parallel processing systems that were developed for data warehouses many years previously [3, 4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.

事实上,批处理是一种非常古老的计算形式。早在可编程数字计算机发明之前,打孔卡制表机(例如 1890 年美国人口普查中使用的 Hollerith 机器 [ 6 ])就实现了半机械化的批处理形式,以根据大量输入计算汇总统计数据。MapReduce 与 20 世纪 40 年代和 1950 年代广泛用于商业数据处理的机电 IBM 卡片分类机有着惊人的相似之处 [7 ]。像往常一样,历史有重演的趋势。

In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hollerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And MapReduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself.

在本章中,我们将了解 MapReduce 和其他几种批处理算法和框架,并探讨它们如何在现代数据系统中使用。但首先,首先,我们将了解使用标准 Unix 工具进行数据处理。即使您已经熟悉它们,提醒一下 Unix 哲学也是值得的,因为 Unix 的思想和经验教训可以延续到大规模、异构分布式数据系统。

In this chapter, we will look at MapReduce and several other batch processing algorithms and frameworks, and explore how they are used in modern data systems. But first, to get started, we will look at data processing using standard Unix tools. Even if you are already familiar with them, a reminder about the Unix philosophy is worthwhile because the ideas and lessons from Unix carry over to large-scale, heterogeneous distributed data systems.

使用 Unix 工具进行批处理

Batch Processing with Unix Tools

让我们从一个简单的例子开始。假设您有一个 Web 服务器,每次处理请求时都会在日志文件中附加一行。例如,使用 nginx 默认访问日志格式,日志中的一行可能如下所示:

Let’s start with a simple example. Say you have a web server that appends a line to a log file every time it serves a request. For example, using the nginx default access log format, one line of the log might look like this:

216.58.210.78 - - [27/2/2015:17:55:11 +0000]“获取/css/typography.css HTTP/1.1”
200 3377“http://martin.kleppmann.com/”“Mozilla/5.0(Macintosh;Intel Mac OS X
10_9_5) AppleWebKit/537.36(KHTML,如 Gecko)Chrome/40.0.2214.115
野生动物园/537.36"
216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1"
200 3377 "http://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115
Safari/537.36"

(这实际上是一行;为了便于阅读,这里只是将其分成多行。)该行中有很多信息。为了解释它,需要查看日志格式的定义,如下:

(That is actually one line; it’s only broken onto multiple lines here for readability.) There’s a lot of information in that line. In order to interpret it, you need to look at the definition of the log format, which is as follows:

$remote_addr - $remote_user [$time_local] "$request"
$status $body_bytes_sent "$http_referer" "$http_user_agent"
$remote_addr - $remote_user [$time_local] "$request"
$status $body_bytes_sent "$http_referer" "$http_user_agent"

因此,日志的这一行表明,在 UTC 时间 2015 年 2 月 27 日 17:55:11,服务器收到了来自客户端 IP 地址 216.58.210.78对文件/css/typography.css的请求。用户未经过身份验证,因此$remote_user设置为连字符 ( -)。响应状态为200(即请求成功),响应大小为3,377字节。Web 浏览器是 Chrome 40,它加载了该文件,因为 URL http://martin.kleppmann.com/的页面中引用了该文件 。

So, this one line of the log indicates that on February 27, 2015, at 17:55:11 UTC, the server received a request for the file /css/typography.css from the client IP address 216.58.210.78. The user was not authenticated, so $remote_user is set to a hyphen (-). The response status was 200 (i.e., the request was successful), and the response was 3,377 bytes in size. The web browser was Chrome 40, and it loaded the file because it was referenced in the page at the URL http://martin.kleppmann.com/.

简单的日志分析

Simple Log Analysis

各种工具可以获取这些日志文件并生成有关网站流量的漂亮报告,但为了练习,让我们使用基本的 Unix 工具构建我们自己的工具。例如,假设您想查找网站上五个最受欢迎的页面。您可以在 Unix shell 中执行此操作,如下所示:i

Various tools can take these log files and produce pretty reports about your website traffic, but for the sake of exercise, let’s build our own, using basic Unix tools. For example, say you want to find the five most popular pages on your website. You can do this in a Unix shell as follows:i

cat /var/log/nginx/access.log | 1
  awk '{print $7}' | 2
  sort             | 3
  uniq -c          | 4
  sort -r -n       | 5
  head -n 5          6
cat /var/log/nginx/access.log | 
  awk '{print $7}' | 
  sort             | 
  uniq -c          | 
  sort -r -n       | 
  head -n 5          
1

阅读日志文件。

Read the log file.

2

用空格将每一行分成多个字段,并只输出每行中的第七个这样的字段,这恰好是请求的 URL。在我们的示例行中,此请求 URL 是 /css/typography.css

Split each line into fields by whitespace, and output only the seventh such field from each line, which happens to be the requested URL. In our example line, this request URL is /css/typography.css.

3

按字母顺序排列sort的请求 URL 列表。如果某个 URL 已被请求n次,则排序后,该文件包含连续重复n次的相同 URL 。

Alphabetically sort the list of requested URLs. If some URL has been requested n times, then after sorting, the file contains the same URL repeated n times in a row.

4

uniq命令通过检查两个相邻行是否相同来过滤掉输入中的重复行。该-c选项告诉它还输出一个计数器:对于每个不同的 URL,它报告该 URL 在输入中出现的次数。

The uniq command filters out repeated lines in its input by checking whether two adjacent lines are the same. The -c option tells it to also output a counter: for every distinct URL, it reports how many times that URL appeared in the input.

5

第二个按每行开头的sort数字 ( -n) 排序,该数字是请求 URL 的次数。然后,它按相反 ( ) 顺序返回结果-r,即首先返回最大的数字。

The second sort sorts by the number (-n) at the start of each line, which is the number of times the URL was requested. It then returns the results in reverse (-r) order, i.e. with the largest number first.

6

最后,head仅输出输入的前五行 ( -n 5),并丢弃其余的。

Finally, head outputs just the first five lines (-n 5) of input, and discards the rest.

这一系列命令的输出如下所示:

The output of that series of commands looks something like this:

第4189章
3631 /2013/05/24/改进 ssh-private-keys 的安全性.html
2124 /2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1369/
 第915章
4189 /favicon.ico
3631 /2013/05/24/improving-security-of-ssh-private-keys.html
2124 /2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1369 /
 915 /css/typography.css

尽管如果您不熟悉 Unix 工具,前面的命令行可能看起来有点晦涩,但它的功能却非常强大。它将在几秒钟内处理千兆字节的日志文件,并且您可以轻松修改分析以满足您的需求。例如,如果您想从报告中省略 CSS 文件,请将awk参数更改为'$7 !~ /\.css$/ {print $7}'. 如果您想要计算热门客户端 IP 地址而不是热门页面,请将awk参数更改为'{print $1}'。等等。

Although the preceding command line likely looks a bit obscure if you’re unfamiliar with Unix tools, it is incredibly powerful. It will process gigabytes of log files in a matter of seconds, and you can easily modify the analysis to suit your needs. For example, if you want to omit CSS files from the report, change the awk argument to '$7 !~ /\.css$/ {print $7}'. If you want to count top client IP addresses instead of top pages, change the awk argument to '{print $1}'. And so on.

本书中没有篇幅来详细探讨 Unix 工具,但它们非常值得学习。awk令人惊讶的是,使用、sedgrepsortuniq和的某种组合可以在几分钟内完成许多数据分析xargs,并且它们的性能令人惊讶地好[ 8 ]。

We don’t have space in this book to explore Unix tools in detail, but they are very much worth learning about. Surprisingly many data analyses can be done in a few minutes using some combination of awk, sed, grep, sort, uniq, and xargs, and they perform surprisingly well [8].

命令链与自定义程序

Chain of commands versus custom program

您可以编写一个简单的程序来完成同样的事情,而不是使用 Unix 命令链。例如,在 Ruby 中,它可能看起来像这样:

Instead of the chain of Unix commands, you could write a simple program to do the same thing. For example, in Ruby, it might look something like this:

counts = Hash.new(0) 1

File.open('/var/log/nginx/access.log') do |file|
  file.each do |line|
    url = line.split[6] 2
    counts[url] += 1 3
  end
end

top5 = counts.map{|url, count| [count, url] }.sort.reverse[0...5] 4
top5.each{|count, url| puts "#{count} #{url}" } 5
counts = Hash.new(0) 

File.open('/var/log/nginx/access.log') do |file|
  file.each do |line|
    url = line.split[6] 
    counts[url] += 1 
  end
end

top5 = counts.map{|url, count| [count, url] }.sort.reverse[0...5] 
top5.each{|count, url| puts "#{count} #{url}" } 
1

counts是一个哈希表,用于记录我们看到每个 URL 的次数。计数器默认为零。

counts is a hash table that keeps a counter for the number of times we’ve seen each URL. A counter is zero by default.

2

从日志的每一行中,我们将 URL 作为第七个空格分隔的字段(这里的数组索引是 6,因为 Ruby 的数组是零索引的)。

From each line of the log, we take the URL to be the seventh whitespace-separated field (the array index here is 6 because Ruby’s arrays are zero-indexed).

3

增加日志当前行中 URL 的计数器。

Increment the counter for the URL in the current line of the log.

4

按计数器值(降序)对哈希表内容进行排序,并取出前 5 个条目。

Sort the hash table contents by counter value (descending), and take the top five entries.

5

打印出前五个条目。

Print out those top five entries.

这个程序不像 Unix 管道链那么简洁,但它具有相当的可读性,您更喜欢这两个程序中的哪一个在一定程度上取决于您的品味。然而,除了两者之间表面上的语法差异之外,执行流程也存在很大差异,如果您在大文件上运行此分析,这一点就会变得很明显。

This program is not as concise as the chain of Unix pipes, but it’s fairly readable, and which of the two you prefer is partly a matter of taste. However, besides the superficial syntactic differences between the two, there is a big difference in the execution flow, which becomes apparent if you run this analysis on a large file.

排序与内存聚合

Sorting versus in-memory aggregation

Ruby 脚本在内存中保存一个 URL 哈希表,其中每个 URL 都映射到它被看到的次数。Unix 管道示例没有这样的哈希表,而是依赖于对 URL 列表进行排序,其中同一 URL 的多次出现只是重复。

The Ruby script keeps an in-memory hash table of URLs, where each URL is mapped to the number of times it has been seen. The Unix pipeline example does not have such a hash table, but instead relies on sorting a list of URLs in which multiple occurrences of the same URL are simply repeated.

哪种方法更好?这取决于您有多少个不同的 URL。对于大多数中小型网站,您可能可以在(例如)1 GB 内存中容纳所有不同的 URL 以及每个 URL 的计数器。在此示例中,作业的工作集(作业需要随机访问的内存量)仅取决于不同 URL 的数量:如果单个 URL 有一百万个日志条目,则哈希中所需的空间table 仍然只是一个 URL 加上计数器的大小。如果这个工作集足够小,内存中的哈希表就可以正常工作——即使在笔记本电脑上也是如此。

Which approach is better? It depends how many different URLs you have. For most small to mid-sized websites, you can probably fit all distinct URLs, and a counter for each URL, in (say) 1 GB of memory. In this example, the working set of the job (the amount of memory to which the job needs random access) depends only on the number of distinct URLs: if there are a million log entries for a single URL, the space required in the hash table is still just one URL plus the size of the counter. If this working set is small enough, an in-memory hash table works fine—even on a laptop.

另一方面,如果作业的工作集大于可用内存,则排序方法的优点是可以有效地利用磁盘。这与我们在“SSTables 和 LSM-Trees”中讨论的原理相同 :数据块可以在内存中排序并作为段文件写入磁盘,然后可以将多个排序的段合并为一个更大的排序文件。合并排序具有在磁盘上表现良好的顺序访问模式。(请记住,优化顺序 I/O 是第 3 章中反复出现的主题。这里再次出现了相同的模式。)

On the other hand, if the job’s working set is larger than the available memory, the sorting approach has the advantage that it can make efficient use of disks. It’s the same principle as we discussed in “SSTables and LSM-Trees”: chunks of data can be sorted in memory and written out to disk as segment files, and then multiple sorted segments can be merged into a larger sorted file. Mergesort has sequential access patterns that perform well on disks. (Remember that optimizing for sequential I/O was a recurring theme in Chapter 3. The same pattern reappears here.)

GNU Coreutils (Linux) 中的实用程序sort通过溢出到磁盘自动处理大于内存的数据集,并自动跨多个 CPU 核心并行排序 [ 9 ]。这意味着我们之前看到的简单的 Unix 命令链可以轻松扩展到大型数据集,而不会耗尽内存。瓶颈可能是从磁盘读取输入文件的速率。

The sort utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by spilling to disk, and automatically parallelizes sorting across multiple CPU cores [9]. This means that the simple chain of Unix commands we saw earlier easily scales to large datasets, without running out of memory. The bottleneck is likely to be the rate at which the input file can be read from disk.

Unix 哲学

The Unix Philosophy

我们能够使用前面示例中的一系列命令轻松分析日志文件并非巧合:这实际上是 Unix 的关键设计思想之一,并且在今天仍然具有惊人的相关性。让我们更深入地研究它,以便我们可以借鉴 Unix [ 10 ] 的一些想法。

It’s no coincidence that we were able to analyze a log file quite easily, using a chain of commands like in the previous example: this was in fact one of the key design ideas of Unix, and it remains astonishingly relevant today. Let’s look at it in some more depth so that we can borrow some ideas from Unix [10].

Unix 管道的发明者道格·麦克罗伊 (Doug McIlroy) 在 1964 年首次这样描述它们 [ 11 ]:“我们应该有一些连接程序的方法,例如花园软管——当需要以另一种方式处理数据时,在另一个部分拧上螺丝。I/O 也是如此。” 管道类比继续存在,用管道连接程序的想法成为现在所谓的Unix 哲学的一部分——一套在 Unix 开发者和用户中流行的设计原则。1978 年,这一理念被描述如下 [ 12 , 13 ]:

Doug McIlroy, the inventor of Unix pipes, first described them like this in 1964 [11]: “We should have some ways of connecting programs like [a] garden hose—screw in another segment when it becomes necessary to massage data in another way. This is the way of I/O also.” The plumbing analogy stuck, and the idea of connecting programs with pipes became part of what is now known as the Unix philosophy—a set of design principles that became popular among the developers and users of Unix. The philosophy was described in 1978 as follows [12, 13]:

  1. 让每个程序做好一件事。要完成一项新工作,请重新构建,而不是通过添加新“功能”来使旧程序复杂化。

  2. 期望每个程序的输出都成为另一个未知程序的输入。不要用无关的信息来扰乱输出。避免严格的柱状或二进制输入格式。不要坚持交互式输入。

  3. 设计和构建软件,甚至操作系统,尽早试用,最好在几周内试用。毫不犹豫地扔掉笨拙的部件并重建它们。

  4. 优先使用工具而不是不熟练的帮助来减轻编程任务,即使您必须绕道构建工具并期望在使用完它们后扔掉其中一些工具。

  1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.

  2. Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.

  3. Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.

  4. Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them.

这种方法——自动化、快速原型设计、增量迭代、对实验友好以及将大型项目分解为可管理的块——听起来非常像当今的敏捷和 DevOps 运动。令人惊讶的是,四十年来几乎没有什么变化。

This approach—automation, rapid prototyping, incremental iteration, being friendly to experimentation, and breaking down large projects into manageable chunks—sounds remarkably like the Agile and DevOps movements of today. Surprisingly little has changed in four decades.

sort工具是一个很好的例子,说明程序可以很好地完成一件事。可以说,它是比大多数编程语言在其标准库中具有的更好的排序实现(不会溢出到磁盘并且不使用多线程,即使这会是有益的)。然而,sort 孤立地看几乎没有什么用处。它只有与其他 Unix 工具(例如uniq.

The sort tool is a great example of a program that does one thing well. It is arguably a better sorting implementation than most programming languages have in their standard libraries (which do not spill to disk and do not use multiple threads, even when that would be beneficial). And yet, sort is barely useful in isolation. It only becomes powerful in combination with the other Unix tools, such as uniq.

像 Unix shell 这样的工具bash让我们可以轻松地将这些小程序组合成非常强大的数据处理作业。尽管其中许多程序是由不同群体编写的,但它们可以通过灵活的方式连接在一起。Unix 做了什么来实现这种可组合性?

A Unix shell like bash lets us easily compose these small programs into surprisingly powerful data processing jobs. Even though many of these programs are written by different groups of people, they can be joined together in flexible ways. What does Unix do to enable this composability?

统一的界面

A uniform interface

如果您希望一个程序的输出成为另一个程序的输入,则意味着这些程序必须使用相同的数据格式,换句话说,即兼容的接口。如果您希望能够将任何程序的输出连接到任何程序的输入,这意味着所有程序必须使用相同的输入/输出接口。

If you expect the output of one program to become the input to another program, that means those programs must use the same data format—in other words, a compatible interface. If you want to be able to connect any program’s output to any program’s input, that means that all programs must use the same input/output interface.

在 Unix 中,该接口是一个文件(或者更准确地说,是一个文件描述符)。文件只是一个有序的字节序列。因为这是一个如此简单的接口,所以可以使用相同的接口来表示许多不同的东西:文件系统上的实际文件,到另一个进程的通信通道(Unix套接字,,),设备驱动程序(例如或)stdinstdout代表/dev/audio/dev/lp0套接字TCP 连接等。人们很容易认为这是理所当然的,但实际上非常了不起的是,这些截然不同的东西可以共享一个统一的接口,因此它们可以轻松地插入在一起。二、

In Unix, that interface is a file (or, more precisely, a file descriptor). A file is just an ordered sequence of bytes. Because that is such a simple interface, many different things can be represented using the same interface: an actual file on the filesystem, a communication channel to another process (Unix socket, stdin, stdout), a device driver (say /dev/audio or /dev/lp0), a socket representing a TCP connection, and so on. It’s easy to take this for granted, but it’s actually quite remarkable that these very different things can share a uniform interface, so they can easily be plugged together.ii

按照惯例,许多(但不是全部)Unix 程序将此字节序列视为 ASCII 文本。我们的日志分析示例使用了以下事实:awksortuniq、 和都将其输入文件视为由(换行符, ASCII ) 字符head分隔的记录列表。的选择是任意的 — 可以说,ASCII 记录分隔符会是一个更好的选择,因为它就是用于此目的 [ 14 ] — 但无论如何,所有这些程序都已标准化使用相同的记录分隔符这一事实允许它们进行互操作。\n0x0A\n0x1E

By convention, many (but not all) Unix programs treat this sequence of bytes as ASCII text. Our log analysis example used this fact: awk, sort, uniq, and head all treat their input file as a list of records separated by the \n (newline, ASCII 0x0A) character. The choice of \n is arbitrary—arguably, the ASCII record separator 0x1E would have been a better choice, since it’s intended for this purpose [14]—but in any case, the fact that all these programs have standardized on using the same record separator allows them to interoperate.

每条记录(即一行输入)的解析更加模糊。Unix 工具通常通过空格或制表符将一行拆分为多个字段,但也使用 CSV(逗号分隔)、管道分隔和其他编码。即使像这样相当简单的工具也xargs有六个命令行选项,用于指定如何解析其输入。

The parsing of each record (i.e., a line of input) is more vague. Unix tools commonly split a line into fields by whitespace or tab characters, but CSV (comma-separated), pipe-separated, and other encodings are also used. Even a fairly simple tool like xargs has half a dozen command-line options for specifying how its input should be parsed.

ASCII 文本的统一界面大部分都可以工作,但它并不完全美观:我们的日志分析示例用于{print $7}提取 URL,可读性不太好。在理想的世界里,这也许是{print $request_url}这样的。稍后我们将回到这个想法。

The uniform interface of ASCII text mostly works, but it’s not exactly beautiful: our log analysis example used {print $7} to extract the URL, which is not very readable. In an ideal world this could have perhaps been {print $request_url} or something of that sort. We will return to this idea later.

尽管它并不完美,但即使在几十年后,Unix 的统一接口仍然是了不起的。没有多少软件可以像 Unix 工具那样进行互操作和编写:您无法通过自定义分析工具轻松地将电子邮件帐户的内容和在线购物历史记录传输到电子表格中,并将结果发布到社交网络或维基百科。今天,程序像 Unix 工具一样顺利地协同工作只是一个例外,而不是常态。

Although it’s not perfect, even decades later, the uniform interface of Unix is still something remarkable. Not many pieces of software interoperate and compose as well as Unix tools do: you can’t easily pipe the contents of your email account and your online shopping history through a custom analysis tool into a spreadsheet and post the results to a social network or a wiki. Today it’s an exception, not the norm, to have programs that work together as smoothly as Unix tools do.

即使具有相同数据模型的数据库通常也无法轻松地将数据从一个数据库中取出并存入另一个数据库中。这种缺乏整合导致数据的巴尔干化。

Even databases with the same data model often don’t make it easy to get data out of one and into the other. This lack of integration leads to Balkanization of data.

逻辑和接线分离

Separation of logic and wiring

Unix 工具的另一个特征是它们使用标准输入 ( stdin) 和标准输出 ( stdout)。如果您运行一个程序并且没有指定任何其他内容,stdin则来自键盘并stdout转到屏幕。但是,您也可以从文件中获取输入和/或将输出重定向到文件。管道允许您将stdout一个进程的 的 附加到另一个进程的 上stdin(使用一个小的内存缓冲区,并且无需将整个中间数据流写入磁盘)。

Another characteristic feature of Unix tools is their use of standard input (stdin) and standard output (stdout). If you run a program and don’t specify anything else, stdin comes from the keyboard and stdout goes to the screen. However, you can also take input from a file and/or redirect output to a file. Pipes let you attach the stdout of one process to the stdin of another process (with a small in-memory buffer, and without writing the entire intermediate data stream to disk).

如果需要,程序仍然可以直接读取和写入文件,但如果程序不担心特定文件路径并且仅使用 和 ,则 Unix 方法效果stdin最佳stdout。这允许 shell 用户以任何他们想要的方式连接输入和输出;程序不知道也不关心输入来自哪里以及输出要去哪里。(可以说这是一种松散耦合后期绑定[ 15 ] 或控制反转 [ 16 ] 的形式。)将输入/输出接线与程序逻辑分离可以更轻松地将小型工具组合成更大的系统。

A program can still read and write files directly if it needs to, but the Unix approach works best if a program doesn’t worry about particular file paths and simply uses stdin and stdout. This allows a shell user to wire up the input and output in whatever way they want; the program doesn’t know or care where the input is coming from and where the output is going to. (One could say this is a form of loose coupling, late binding [15], or inversion of control [16].) Separating the input/output wiring from the program logic makes it easier to compose small tools into bigger systems.

您甚至可以编写自己的程序并将其与操作系统提供的工具结合起来。您的程序只需要从 读取输入stdin并将输出写入到stdout,并且它可以参与数据处理管道。在日志分析示例中,您可以编写一个将用户代理字符串转换为更明智的浏览器标识符的工具,或者将 IP 地址转换为国家/地区代码的工具,然后将其简单地插入管道中。该sort程序并不关心它是与操作系统的其他部分还是与您编写的程序进行通信。

You can even write your own programs and combine them with the tools provided by the operating system. Your program just needs to read input from stdin and write output to stdout, and it can participate in data processing pipelines. In the log analysis example, you could write a tool that translates user-agent strings into more sensible browser identifiers, or a tool that translates IP addresses into country codes, and simply plug it into the pipeline. The sort program doesn’t care whether it’s communicating with another part of the operating system or with a program written by you.

stdin但是,您可以使用和执行的操作是有限的stdout。需要多个输入或输出的程序是可能的,但很棘手。您无法将程序的输出通过管道传输到网络连接 [ 17 , 18 ]。三、 如果一个程序直接打开文件进行读写,或者启动另一个程序作为子进程,或者打开网络连接,那么该 I/O 是由程序本身连接的。它仍然可以配置(例如通过命令行选项),但在 shell 中连接输入和输出的灵活性会降低。

However, there are limits to what you can do with stdin and stdout. Programs that need multiple inputs or outputs are possible but tricky. You can’t pipe a program’s output into a network connection [17, 18].iii If a program directly opens files for reading and writing, or starts another program as a subprocess, or opens a network connection, then that I/O is wired up by the program itself. It can still be configurable (through command-line options, for example), but the flexibility of wiring up inputs and outputs in a shell is reduced.

透明度和实验

Transparency and experimentation

Unix 工具如此成功的部分原因是它们让我们很容易看到正在发生的事情:

Part of what makes Unix tools so successful is that they make it quite easy to see what is going on:

  • Unix 命令的输入文件通常被视为不可变的。这意味着您可以根据需要多次运行命令,尝试各种命令行选项,而不会损坏输入文件。

  • The input files to Unix commands are normally treated as immutable. This means you can run the commands as often as you want, trying various command-line options, without damaging the input files.

  • 您可以随时结束管道,将输出通过管道传输到less,然后查看它是否具有预期的形式。这种检查能力对于调试非常有用。

  • You can end the pipeline at any point, pipe the output into less, and look at it to see if it has the expected form. This ability to inspect is great for debugging.

  • 您可以将一个管道阶段的输出写入文件,并使用该文件作为下一阶段的输入。这允许您重新启动后期阶段,而无需重新运行整个管道。

  • You can write the output of one pipeline stage to a file and use that file as input to the next stage. This allows you to restart the later stage without rerunning the entire pipeline.

因此,尽管 Unix 工具与关系数据库的查询优化器相比是相当生硬、简单的工具,但它们仍然非常有用,尤其是对于实验而言。

Thus, even though Unix tools are quite blunt, simple tools compared to a query optimizer of a relational database, they remain amazingly useful, especially for experimentation.

然而,Unix 工具的最大限制是它们只能在一台机器上运行,这就是 Hadoop 等工具的用武之地。

However, the biggest limitation of Unix tools is that they run only on a single machine—and that’s where tools like Hadoop come in.

MapReduce 和分布式文件系统

MapReduce and Distributed Filesystems

MapReduce 有点像 Unix 工具,但分布在潜在的数千台机器上。与 Unix 工具一样,它是一个相当生硬、暴力但出奇有效的工具。单个 MapReduce 作业与单个 Unix 进程相当:它接受一个或多个输入并产生一个或多个输出。

MapReduce is a bit like Unix tools, but distributed across potentially thousands of machines. Like Unix tools, it is a fairly blunt, brute-force, but surprisingly effective tool. A single MapReduce job is comparable to a single Unix process: it takes one or more inputs and produces one or more outputs.

与大多数 Unix 工具一样,运行 MapReduce 作业通常不会修改输入,并且除了生成输出之外不会产生任何副作用。输出文件以顺序方式写入一次(写入后不会修改文件的任何现有部分)。

As with most Unix tools, running a MapReduce job normally does not modify the input and does not have any side effects other than producing the output. The output files are written once, in a sequential fashion (not modifying any existing part of a file once it has been written).

Unix 工具使用stdinstdout作为输入和输出,而 MapReduce 作业则在分布式文件系统上读取和写入文件。在 Hadoop 的 MapReduce 实现中,该文件系统称为 HDFS(Hadoop 分布式文件系统),它是 Google 文件系统 (GFS) [ 19 ] 的开源重新实现。

While Unix tools use stdin and stdout as input and output, MapReduce jobs read and write files on a distributed filesystem. In Hadoop’s implementation of MapReduce, that filesystem is called HDFS (Hadoop Distributed File System), an open source reimplementation of the Google File System (GFS) [19].

除了 HDFS 之外,还存在各种其他分布式文件系统,例如 GlusterFS 和 Quantcast 文件系统(QFS)[ 20 ]。Amazon S3、Azure Blob Storage 和 OpenStack Swift [ 21 ] 等对象存储服务在很多方面都很相似。iv 在本章中,我们将主要使用 HDFS 作为运行示例,但原则适用于任何分布式文件系统。

Various other distributed filesystems besides HDFS exist, such as GlusterFS and the Quantcast File System (QFS) [20]. Object storage services such as Amazon S3, Azure Blob Storage, and OpenStack Swift [21] are similar in many ways.iv In this chapter we will mostly use HDFS as a running example, but the principles apply to any distributed filesystem.

HDFS 基于无共享原则(请参阅第 II 部分的介绍),与网络附加存储(NAS) 和存储区域网络 (SAN) 架构的共享磁盘方法形成对比。共享磁盘存储由集中式存储设备实现,通常使用定制硬件和特殊网络基础设施(例如光纤通道)。另一方面,无共享方法不需要特殊的硬件,只需要通过传统数据中心网络连接的计算机。

HDFS is based on the shared-nothing principle (see the introduction to Part II), in contrast to the shared-disk approach of Network Attached Storage (NAS) and Storage Area Network (SAN) architectures. Shared-disk storage is implemented by a centralized storage appliance, often using custom hardware and special network infrastructure such as Fibre Channel. On the other hand, the shared-nothing approach requires no special hardware, only computers connected by a conventional datacenter network.

HDFS 由在每台计算机上运行的守护进程组成,公开一个网络服务,允许其他节点访问存储在该计算机上的文件(假设数据中心中的每台通用计算机都附加了一些磁盘)。称为NameNode的中央服务器会跟踪哪些文件块存储在哪台机器上。因此,HDFS 从概念上创建了一个大文件系统,可以使用运行守护程序的所有计算机的磁盘空间。

HDFS consists of a daemon process running on each machine, exposing a network service that allows other nodes to access files stored on that machine (assuming that every general-purpose machine in a datacenter has some disks attached to it). A central server called the NameNode keeps track of which file blocks are stored on which machine. Thus, HDFS conceptually creates one big filesystem that can use the space on the disks of all machines running the daemon.

为了容忍机器和磁盘故障,文件块被复制到多台机器上。复制可能意味着在多台机器上简单地复制相同数据的多个副本,如 第 5 章所述,或者是诸如里德-所罗门码之类的纠删码方案,它允许以比完全复制更低的存储开销来恢复丢失的数据 [ 20 , 22 ] 。这些技术类似于 RAID,它为连接到同一台计算机的多个磁盘提供冗余;不同之处在于,在分布式文件系统中,文件访问和复制是通过传统的数据中心网络完成的,无需特殊硬件。

In order to tolerate machine and disk failures, file blocks are replicated on multiple machines. Replication may mean simply several copies of the same data on multiple machines, as in Chapter 5, or an erasure coding scheme such as Reed–Solomon codes, which allows lost data to be recovered with lower storage overhead than full replication [20, 22]. The techniques are similar to RAID, which provides redundancy across several disks attached to the same machine; the difference is that in a distributed filesystem, file access and replication are done over a conventional datacenter network without special hardware.

HDFS 具有良好的可扩展性:在撰写本文时,最大的 HDFS 部署在数万台机器上运行,总存储容量为数百 PB [ 23 ]。如此大规模已经变得可行,因为使用商用硬件和开源软件在 HDFS 上存储和访问数据的成本远低于专用存储设备上同等容量的成本[24 ]

HDFS has scaled well: at the time of writing, the biggest HDFS deployments run on tens of thousands of machines, with combined storage capacity of hundreds of petabytes [23]. Such large scale has become viable because the cost of data storage and access on HDFS, using commodity hardware and open source software, is much lower than that of the equivalent capacity on a dedicated storage appliance [24].

MapReduce 作业执行

MapReduce Job Execution

MapReduce 是一个编程框架,您可以使用它编写代码来处理 HDFS 等分布式文件系统中的大型数据集。最简单的理解方法是参考《简单日志分析》中的Web服务器日志分析示例。MapReduce 中的数据处理模式与此示例非常相似:

MapReduce is a programming framework with which you can write code to process large datasets in a distributed filesystem like HDFS. The easiest way of understanding it is by referring back to the web server log analysis example in “Simple Log Analysis”. The pattern of data processing in MapReduce is very similar to this example:

  1. 读取一组输入文件,并将其分解为记录。在Web服务器日志示例中,每条记录都是日志中的一行(即\n记录分隔符)。

  2. Read a set of input files, and break it up into records. In the web server log example, each record is one line in the log (that is, \n is the record separator).

  3. 调用映射器函数从每个输入记录中提取键和值。在前面的示例中,映射器函数是awk '{print $7}':它提取 URL ( $7) 作为键,并将值保留为空。

  4. Call the mapper function to extract a key and value from each input record. In the preceding example, the mapper function is awk '{print $7}': it extracts the URL ($7) as the key, and leaves the value empty.

  5. 按键对所有键值对进行排序。在日志示例中,这是由第一个sort 命令完成的。

  6. Sort all of the key-value pairs by key. In the log example, this is done by the first sort command.

  7. 调用reducer函数来迭代排序后的键值对。如果同一键多次出现,排序会使它们在列表中相邻,因此很容易组合这些值,而无需在内存中保留大量状态。在上面的例子中,reducer是通过命令实现的uniq -c,它统计具有相同key的相邻记录的数量。

  8. Call the reducer function to iterate over the sorted key-value pairs. If there are multiple occurrences of the same key, the sorting has made them adjacent in the list, so it is easy to combine those values without having to keep a lot of state in memory. In the preceding example, the reducer is implemented by the command uniq -c, which counts the number of adjacent records with the same key.

这四个步骤可以由一个 MapReduce 作业执行。步骤 2(map)和 4(reduce)是您编写自定义数据处理代码的地方。第 1 步(将文件分解为记录)由输入格式解析器处理。第 3 步,即sort步骤,在 MapReduce 中是隐式的 — 您不必编写它,因为映射器的输出在提供给缩减器之前始终已排序。

Those four steps can be performed by one MapReduce job. Steps 2 (map) and 4 (reduce) are where you write your custom data processing code. Step 1 (breaking files into records) is handled by the input format parser. Step 3, the sort step, is implicit in MapReduce—you don’t have to write it, because the output from the mapper is always sorted before it is given to the reducer.

要创建 MapReduce 作业,您需要实现两个回调函数,即映射器和化简器,其行为如下(另请参阅“MapReduce 查询”):

To create a MapReduce job, you need to implement two callback functions, the mapper and reducer, which behave as follows (see also “MapReduce Querying”):

映射器
Mapper

对于每个输入记录都会调用一次映射器,其工作是从输入记录中提取键和值。对于每个输入,它可能生成任意数量的键值对(包括无)。它不保留从一个输入记录到下一个输入记录的任何状态,因此每个记录都是独立处理的。

The mapper is called once for every input record, and its job is to extract the key and value from the input record. For each input, it may generate any number of key-value pairs (including none). It does not keep any state from one input record to the next, so each record is handled independently.

减速器
Reducer

MapReduce 框架获取映射器生成的键值对,收集属于同一键的所有值,并使用该值集合的迭代器调用缩减器。该reducer可以产生输出记录(例如相同URL出现的次数)。

The MapReduce framework takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the reducer with an iterator over that collection of values. The reducer can produce output records (such as the number of occurrences of the same URL).

在 Web 服务器日志示例中,我们sort在步骤 5 中有第二个命令,该命令按请求数量对 URL 进行排名。在 MapReduce 中,如果需要第二个排序阶段,可以通过编写第二个 MapReduce 作业并使用第一个作业的输出作为第二个作业的输入来实现。这样看来,mapper 的作用是准备数据,将其放入适合排序的形式,reducer 的作用是处理已排序的数据。

In the web server log example, we had a second sort command in step 5, which ranked URLs by number of requests. In MapReduce, if you need a second sorting stage, you can implement it by writing a second MapReduce job and using the output of the first job as input to the second job. Viewed like this, the role of the mapper is to prepare the data by putting it into a form that is suitable for sorting, and the role of the reducer is to process the data that has been sorted.

MapReduce的分布式执行

Distributed execution of MapReduce

与 Unix 命令管道的主要区别在于 MapReduce 可以跨多台机器并行计算,而无需编写代码来显式处理并行性。Mapper和Reducer一次只操作一条记录;他们不需要知道输入来自哪里或输出要去哪里,因此该框架可以处理机器之间移动数据的复杂性。

The main difference from pipelines of Unix commands is that MapReduce can parallelize a computation across many machines, without you having to write code to explicitly handle the parallelism. The mapper and reducer only operate on one record at a time; they don’t need to know where their input is coming from or their output is going to, so the framework can handle the complexities of moving data between machines.

可以在分布式计算中使用标准 Unix 工具作为映射器和化简器 [ 25 ],但更常见的是它们被实现为传统编程语言中的函数。在 Hadoop MapReduce 中,映射器和缩减器都是实现特定接口的 Java 类。在 MongoDB 和 CouchDB 中,映射器和化简器是 JavaScript 函数(请参阅“MapReduce 查询”)。

It is possible to use standard Unix tools as mappers and reducers in a distributed computation [25], but more commonly they are implemented as functions in a conventional programming language. In Hadoop MapReduce, the mapper and reducer are each a Java class that implements a particular interface. In MongoDB and CouchDB, mappers and reducers are JavaScript functions (see “MapReduce Querying”).

图 10-1显示了 Hadoop MapReduce 作业中的数据流。它的并行化基于分区(参见第 6 章):作业的输入通常是 HDFS 中的目录,输入目录中的每个文件或文件块都被视为一个单独的分区,可以由单独的映射任务处理(在 图10-1中用m 1m 2m 3标记)。

Figure 10-1 shows the dataflow in a Hadoop MapReduce job. Its parallelization is based on partitioning (see Chapter 6): the input to a job is typically a directory in HDFS, and each file or file block within the input directory is considered to be a separate partition that can be processed by a separate map task (marked by m 1, m 2, and m 3 in Figure 10-1).

每个输入文件的大小通常为数百兆字节。MapReduce 调度程序(图中未显示)尝试在存储输入文件副本的一台机器上运行每个映射器,前提是该机器有足够的备用 RAM 和 CPU 资源来运行映射任务 [26 ]。这一原则被称为将计算放在数据附近 [ 27 ]:它节省了通过网络复制输入文件的时间,减少了网络负载并增加了局部性。

Each input file is typically hundreds of megabytes in size. The MapReduce scheduler (not shown in the diagram) tries to run each mapper on one of the machines that stores a replica of the input file, provided that machine has enough spare RAM and CPU resources to run the map task [26]. This principle is known as putting the computation near the data [27]: it saves copying the input file over the network, reducing network load and increasing locality.

迪迪亚1001
图 10-1。具有三个映射器和三个缩减器的 MapReduce 作业。

在大多数情况下,应在映射任务中运行的应用程序代码尚未出现在分配了运行该任务的机器上,因此 MapReduce 框架首先复制代码(例如,对于 Java 程序而言是 JAR 文件) )到适当的机器。然后,它启动映射任务并开始读取输入文件,一次将一个记录传递给映射器回调。映射器的输出由键值对组成。

In most cases, the application code that should run in the map task is not yet present on the machine that is assigned the task of running it, so the MapReduce framework first copies the code (e.g., JAR files in the case of a Java program) to the appropriate machines. It then starts the map task and begins reading the input file, passing one record at a time to the mapper callback. The output of the mapper consists of key-value pairs.

计算的归约端也被分区。Map任务的数量由输入文件块的数量决定,而Reduce任务的数量由作业作者配置(可以与Map任务的数量不同)。为了确保具有相同键的所有键值对最终出现在同一个reducer上,框架使用键的哈希来确定哪个reduce任务应该接收特定的键值对(请参阅“按键的哈希进行分区) 。

The reduce side of the computation is also partitioned. While the number of map tasks is determined by the number of input file blocks, the number of reduce tasks is configured by the job author (it can be different from the number of map tasks). To ensure that all key-value pairs with the same key end up at the same reducer, the framework uses a hash of the key to determine which reduce task should receive a particular key-value pair (see “Partitioning by Hash of Key”).

键值对必须进行排序,但数据集可能太大,无法在单机上使用传统排序算法进行排序。相反,排序是分阶段进行的。首先,每个映射任务根据键的哈希值通过减速器对其输出进行分区。使用类似于我们在“SSTables 和 LSM-Trees”中讨论的技术,将每个分区写入映射器本地磁盘上的排序文件。

The key-value pairs must be sorted, but the dataset is likely too large to be sorted with a conventional sorting algorithm on a single machine. Instead, the sorting is performed in stages. First, each map task partitions its output by reducer, based on the hash of the key. Each of these partitions is written to a sorted file on the mapper’s local disk, using a technique similar to what we discussed in “SSTables and LSM-Trees”.

每当映射器完成读取其输入文件并写入其排序的输出文件时,MapReduce 调度程序就会通知化简器它们可以开始从该映射器获取输出文件。减速器连接到每个映射器并下载其分区的排序键值对文件。通过减速器进行分区、排序以及将数据分区从映射器复制到减速器的过程称为洗牌[ 26 ](一个令人困惑的术语——与洗牌不同,MapReduce 中不存在随机性)。

Whenever a mapper finishes reading its input file and writing its sorted output files, the MapReduce scheduler notifies the reducers that they can start fetching the output files from that mapper. The reducers connect to each of the mappers and download the files of sorted key-value pairs for their partition. The process of partitioning by reducer, sorting, and copying data partitions from mappers to reducers is known as the shuffle [26] (a confusing term—unlike shuffling a deck of cards, there is no randomness in MapReduce).

reduce 任务从映射器获取文件并将它们合并在一起,保留排序顺序。因此,如果不同的映射器生成具有相同键的记录,则它们在合并的缩减器输入中将是相邻的。

The reduce task takes the files from the mappers and merges them together, preserving the sort order. Thus, if different mappers produced records with the same key, they will be adjacent in the merged reducer input.

使用键和迭代器调用减速器,该迭代器增量扫描具有相同键的所有记录(在某些情况下可能并不全部适合内存)。减速器可以使用任意逻辑来处理这些记录,并且可以生成任意数量的输出记录。这些输出记录被写入分布式文件系统上的文件中(通常,运行减速器的机器的本地磁盘上有一份副本,其他机器上有副本)。

The reducer is called with a key and an iterator that incrementally scans over all records with the same key (which may in some cases not all fit in memory). The reducer can use arbitrary logic to process these records, and can generate any number of output records. These output records are written to a file on the distributed filesystem (usually, one copy on the local disk of the machine running the reducer, with replicas on other machines).

MapReduce 工作流程

MapReduce workflows

使用单个 MapReduce 作业可以解决的问题范围是有限的。回顾一下日志分析示例,单个 MapReduce 作业可以确定每个 URL 的页面浏览量,但不能确定最流行的 URL,因为这需要第二轮排序。

The range of problems you can solve with a single MapReduce job is limited. Referring back to the log analysis example, a single MapReduce job could determine the number of page views per URL, but not the most popular URLs, since that requires a second round of sorting.

因此,将 MapReduce 作业链接在一起形成工作流是很常见的,这样一个作业的输出就成为下一个作业的输入。Hadoop MapReduce 框架对工作流没有任何特殊支持,因此此链接是通过目录名称隐式完成的:第一个作业必须配置为将其输出写入 HDFS 中的指定目录,第二个作业必须配置为读取该目录与其输入相同的目录名称。从MapReduce框架的角度来看,它们是两个独立的工作。

Thus, it is very common for MapReduce jobs to be chained together into workflows, such that the output of one job becomes the input to the next job. The Hadoop MapReduce framework does not have any particular support for workflows, so this chaining is done implicitly by directory name: the first job must be configured to write its output to a designated directory in HDFS, and the second job must be configured to read that same directory name as its input. From the MapReduce framework’s point of view, they are two independent jobs.

因此,链式 MapReduce 作业不太像 Unix 命令的管道(直接将一个进程的输出作为输入传递给另一个进程,仅使用一个小的内存缓冲区),而更像是一系列命令,其中每个命令的输出都写入一个临时文件,下一个命令从临时文件中读取。这种设计有优点也有缺点,我们将在“中间状态的物化”中讨论。

Chained MapReduce jobs are therefore less like pipelines of Unix commands (which pass the output of one process as input to another process directly, using only a small in-memory buffer) and more like a sequence of commands where each command’s output is written to a temporary file, and the next command reads from the temporary file. This design has advantages and disadvantages, which we will discuss in “Materialization of Intermediate State”.

仅当作业成功完成时,批处理作业的输出才被视为有效(MapReduce 会丢弃失败作业的部分输出)。因此,工作流中的一项作业只能在先前作业(即生成其输入目录的作业)成功完成后才能启动。为了处理作业执行之间的这些依赖关系,已经开发了各种 Hadoop 工作流调度程序,包括 Oozie、Azkaban、Luigi、Airflow 和 Pinball [ 28 ]。

A batch job’s output is only considered valid when the job has completed successfully (MapReduce discards the partial output of a failed job). Therefore, one job in a workflow can only start when the prior jobs—that is, the jobs that produce its input directories—have completed successfully. To handle these dependencies between job executions, various workflow schedulers for Hadoop have been developed, including Oozie, Azkaban, Luigi, Airflow, and Pinball [28].

这些调度程序还具有管理功能,在维护大量批处理作业时非常有用。在构建推荐系统时,由 50 到 100 个 MapReduce 作业组成的工作流很常见 [ 29 ],并且在大型组织中,许多不同的团队可能正在运行不同的作业来读取彼此的输出。工具支持对于管理如此复杂的数据流非常重要。

These schedulers also have management features that are useful when maintaining a large collection of batch jobs. Workflows consisting of 50 to 100 MapReduce jobs are common when building recommendation systems [29], and in a large organization, many different teams may be running different jobs that read each other’s output. Tool support is important for managing such complex dataflows.

用于 Hadoop 的各种高级工具,例如 Pig [ 30 ]、Hive [ 31 ]、Cascading [ 32 ]、Crunch [ 33 ] 和 FlumeJava [ 34 ],也设置了多个 MapReduce 阶段的工作流程,这些阶段自动适当地连接在一起。

Various higher-level tools for Hadoop, such as Pig [30], Hive [31], Cascading [32], Crunch [33], and FlumeJava [34], also set up workflows of multiple MapReduce stages that are automatically wired together appropriately.

减少端连接和分组

Reduce-Side Joins and Grouping

我们在第 2 章中在数据模型和查询语言的背景下讨论了联接,但我们还没有深入研究联接的实际实现方式。现在是我们再次拿起这个话题的时候了。

We discussed joins in Chapter 2 in the context of data models and query languages, but we have not delved into how joins are actually implemented. It is time that we pick up that thread again.

在许多数据集中,一条记录与另一条记录存在关联是很常见的:关系模型中的外键、文档模型中的文档引用或图形模型中的边。每当您有一些代码需要访问该关联两侧的记录(保存引用的记录和被引用的记录)时,就需要连接。正如第 2 章中所讨论的,非规范化可以减少连接的需要,但通常不会完全消除它。v

In many datasets it is common for one record to have an association with another record: a foreign key in a relational model, a document reference in a document model, or an edge in a graph model. A join is necessary whenever you have some code that needs to access records on both sides of that association (both the record that holds the reference and the record being referenced). As discussed in Chapter 2, denormalization can reduce the need for joins but generally not remove it entirely.v

在数据库中,如果执行只涉及少量记录的查询,数据库通常会使用索引快速定位感兴趣的记录(请参阅第 3 章)。如果查询涉及连接,则可能需要多次索引查找。然而,MapReduce 没有索引的概念——至少没有通常意义上的索引。

In a database, if you execute a query that involves only a small number of records, the database will typically use an index to quickly locate the records of interest (see Chapter 3). If the query involves joins, it may require multiple index lookups. However, MapReduce has no concept of indexes—at least not in the usual sense.

当 MapReduce 作业收到一组文件作为输入时,它会读取所有这些文件的全部内容;数据库将此操作称为全表扫描。如果您只想读取少量记录,那么与索引查找相比,全表扫描的成本非常高。然而,在分析查询中(参见“事务处理还是分析?”),通常需要计算大量记录的聚合。在这种情况下,扫描整个输入可能是相当合理的事情,特别是如果您可以跨多台机器并行处理。

When a MapReduce job is given a set of files as input, it reads the entire content of all of those files; a database would call this operation a full table scan. If you only want to read a small number of records, a full table scan is outrageously expensive compared to an index lookup. However, in analytic queries (see “Transaction Processing or Analytics?”) it is common to want to calculate aggregates over a large number of records. In this case, scanning the entire input might be quite a reasonable thing to do, especially if you can parallelize the processing across multiple machines.

当我们在批处理上下文中谈论连接时,我们的意思是解决数据集中出现的所有关联。例如,我们假设一项作业同时处理所有用户的数据,而不仅仅是查找某个特定用户的数据(使用索引可以更有效地完成此操作)。

When we talk about joins in the context of batch processing, we mean resolving all occurrences of some association within a dataset. For example, we assume that a job is processing the data for all users simultaneously, not merely looking up the data for one particular user (which would be done far more efficiently with an index).

示例:用户活动事件分析

Example: analysis of user activity events

批处理作业中连接的典型示例如图10-2所示。左侧是事件日志,描述登录用户在网站上执行的操作(称为活动事件点击流数据),右侧是用户数据库。您可以将此示例视为星型模式的一部分(请参阅“星星和雪花:用于分析的模式”):事件日志是事实表,用户数据库是维度之一。

A typical example of a join in a batch job is illustrated in Figure 10-2. On the left is a log of events describing the things that logged-in users did on a website (known as activity events or clickstream data), and on the right is a database of users. You can think of this example as being part of a star schema (see “Stars and Snowflakes: Schemas for Analytics”): the log of events is the fact table, and the user database is one of the dimensions.

迪迪亚1002
图 10-2。用户活动事件日志和用户配置文件数据库之间的连接。

分析任务可能需要将用户活动与用户个人资料信息相关联:例如,如果个人资料包含用户的年龄或出生日期,系统可以确定哪些页面最受哪些年龄段的欢迎。但是,活动事件仅包含用户 ID,而不包含完整的用户配置文件信息。将该个人资料信息嵌入到每个活动事件中很可能太浪费了。因此,活动事件需要与用户配置文件数据库连接。

An analytics task may need to correlate user activity with user profile information: for example, if the profile contains the user’s age or date of birth, the system could determine which pages are most popular with which age groups. However, the activity events contain only the user ID, not the full user profile information. Embedding that profile information in every single activity event would most likely be too wasteful. Therefore, the activity events need to be joined with the user profile database.

此连接的最简单实现将一一检查活动事件,并查询用户数据库(在远程服务器上)以获取它遇到的每个用户 ID。这是可能的,但它很可能会受到非常差的性能的影响:处理吞吐量将受到到数据库服务器的往返时间的限制,本地缓存的有效性将在很大程度上取决于数据的分布,并且运行大量并行查询很容易压垮数据库[ 35 ]。

The simplest implementation of this join would go over the activity events one by one and query the user database (on a remote server) for every user ID it encounters. This is possible, but it would most likely suffer from very poor performance: the processing throughput would be limited by the round-trip time to the database server, the effectiveness of a local cache would depend very much on the distribution of data, and running a large number of queries in parallel could easily overwhelm the database [35].

为了在批处理过程中实现良好的吞吐量,计算必须(尽可能)在一台机器本地进行。通过网络对要处理的每条记录发出随机访问请求的速度太慢。此外,查询远程数据库意味着批处理作业变得不确定,因为远程数据库中的数据可能会发生变化。

In order to achieve good throughput in a batch process, the computation must be (as much as possible) local to one machine. Making random-access requests over the network for every record you want to process is too slow. Moreover, querying a remote database would mean that the batch job becomes nondeterministic, because the data in the remote database might change.

因此,更好的方法是获取用户数据库的副本(例如,使用 ETL 过程从数据库备份中提取 - 请参阅“数据仓库”)并将其与用户活动日志放在同一分布式文件系统中事件。然后,您可以将用户数据库存储在 HDFS 的一组文件中,将用户活动记录存储在另一组文件中,并且可以使用 MapReduce 将所有相关记录集中在同一位置并有效地处理它们。

Thus, a better approach would be to take a copy of the user database (for example, extracted from a database backup using an ETL process—see “Data Warehousing”) and to put it in the same distributed filesystem as the log of user activity events. You would then have the user database in one set of files in HDFS and the user activity records in another set of files, and could use MapReduce to bring together all of the relevant records in the same place and process them efficiently.

排序合并连接

Sort-merge joins

回想一下,映射器的目的是从每个输入记录中提取键和值。在图 10-2的情况下,该键将是用户 ID:一组映射器将遍历活动事件(提取用户 ID 作为键,将活动事件作为值),而另一组映射器将检查活动事件(提取用户 ID 作为键,将活动事件作为值)检查用户数据库(提取用户 ID 作为键,提取用户的出生日期作为值)。该过程如图10-3所示。

Recall that the purpose of the mapper is to extract a key and value from each input record. In the case of Figure 10-2, this key would be the user ID: one set of mappers would go over the activity events (extracting the user ID as the key and the activity event as the value), while another set of mappers would go over the user database (extracting the user ID as the key and the user’s date of birth as the value). This process is illustrated in Figure 10-3.

迪迪亚1003
图 10-3。用户 ID 上的归约端排序合并联接。如果输入数据集被划分为多个文件,则每个文件都可以使用多个映射器并行处理。

当MapReduce框架按键对mapper输出进行分区,然后对键值对进行排序时,效果是所有活动事件和具有相同用户ID的用户记录在reducer输入中变得彼此相邻。MapReduce 作业甚至可以对要排序的记录进行排序,以便缩减程序始终首先查看用户数据库中的记录,然后按时间戳顺序查看活动事件 — 这种技术称为辅助排序[ 26 ]

When the MapReduce framework partitions the mapper output by key and then sorts the key-value pairs, the effect is that all the activity events and the user record with the same user ID become adjacent to each other in the reducer input. The MapReduce job can even arrange the records to be sorted such that the reducer always sees the record from the user database first, followed by the activity events in timestamp order—this technique is known as a secondary sort [26].

然后,reducer 可以轻松执行实际的连接逻辑:为每个用户 ID 调用一次 reducer 函数,并且由于二次排序,第一个值预计是用户数据库中的出生日期记录。减速器将出生日期存储在本地变量中,然后迭代具有相同用户 ID 的活动事件,输出viewed-urlviewer-age-in-years对。随后的 MapReduce 作业可以计算每个 URL 的观看者年龄分布,并按年龄组进行聚类。

The reducer can then perform the actual join logic easily: the reducer function is called once for every user ID, and thanks to the secondary sort, the first value is expected to be the date-of-birth record from the user database. The reducer stores the date of birth in a local variable and then iterates over the activity events with the same user ID, outputting pairs of viewed-url and viewer-age-in-years. Subsequent MapReduce jobs could then calculate the distribution of viewer ages for each URL, and cluster by age group.

由于reducer一次性处理特定用户ID的所有记录,因此它每次只需要在内存中保留一条用户记录,并且不需要通过网络发出任何请求。该算法称为排序合并连接,因为映射器输出按键排序,然后缩减器将来自连接两侧的记录的排序列表合并在一起。

Since the reducer processes all of the records for a particular user ID in one go, it only needs to keep one user record in memory at any one time, and it never needs to make any requests over the network. This algorithm is known as a sort-merge join, since mapper output is sorted by key, and the reducers then merge together the sorted lists of records from both sides of the join.

将相关数据集中在同一位置

Bringing related data together in the same place

在排序合并连接中,映射器和排序过程确保对特定用户 ID 执行连接操作所需的所有数据都集中在同一个位置:对减速器的一次调用。预先排列好所有需要的数据后,reducer 可以是相当简单的单线程代码,可以以高吞吐量和低内存开销来搅动记录。

In a sort-merge join, the mappers and the sorting process make sure that all the necessary data to perform the join operation for a particular user ID is brought together in the same place: a single call to the reducer. Having lined up all the required data in advance, the reducer can be a fairly simple, single-threaded piece of code that can churn through records with high throughput and low memory overhead.

看待这种架构的一种方式是映射器向减速器“发送消息”。当映射器发出键值对时,键的作用就像值应传递到的目标地址一样。尽管密钥只是一个任意字符串(而不是像 IP 地址和端口号这样的实际网络地址),但它的行为就像一个地址:具有相同密钥的所有键值对都将被传递到相同的目的地(调用减速机)。

One way of looking at this architecture is that mappers “send messages” to the reducers. When a mapper emits a key-value pair, the key acts like the destination address to which the value should be delivered. Even though the key is just an arbitrary string (not an actual network address like an IP address and port number), it behaves like an address: all key-value pairs with the same key will be delivered to the same destination (a call to the reducer).

使用 MapReduce 编程模型将计算的物理网络通信方面(将数据获取到正确的机器)与应用程序逻辑(获得数据后进行处理)分开。这种分离与数据库的典型使用形成鲜明对比,在数据库中,从数据库获取数据的请求通常发生在一段应用程序代码的深处[ 36 ]。由于 MapReduce 处理所有网络通信,因此它还使应用程序代码不必担心部分故障,例如另一个节点的崩溃:MapReduce 透明地重试失败的任务,而不会影响应用程序逻辑。

Using the MapReduce programming model has separated the physical network communication aspects of the computation (getting the data to the right machine) from the application logic (processing the data once you have it). This separation contrasts with the typical use of databases, where a request to fetch data from a database often occurs somewhere deep inside a piece of application code [36]. Since MapReduce handles all network communication, it also shields the application code from having to worry about partial failures, such as the crash of another node: MapReduce transparently retries failed tasks without affecting the application logic.

通过...分组

GROUP BY

除了连接之外,“将相关数据带到同一位置”模式的另一个常见用途是按某个键对记录进行分组(如GROUP BYSQL 中的子句)。具有相同键的所有记录形成一个组,下一步通常是在每个组内执行某种聚合,例如:

Besides joins, another common use of the “bringing related data to the same place” pattern is grouping records by some key (as in the GROUP BY clause in SQL). All records with the same key form a group, and the next step is often to perform some kind of aggregation within each group—for example:

  • 计算每个组中的记录数(就像我们计算页面浏览量的示例一样,您可以将其表示为COUNT(*)SQL 中的聚合)

  • Counting the number of records in each group (like in our example of counting page views, which you would express as a COUNT(*) aggregation in SQL)

  • SUM(fieldname)将SQL 中某一特定字段 ( ) 中的值相加

  • Adding up the values in one particular field (SUM(fieldname)) in SQL

  • 根据某种排名函数选取前k条记录

  • Picking the top k records according to some ranking function

使用 MapReduce 实现此类分组操作的最简单方法是设置映射器,以便它们生成的键值对使用所需的分组键。然后,分区和排序过程将具有相同键的所有记录汇集到同一个减速器中。因此,在 MapReduce 之上实现时,分组和连接看起来非常相似。

The simplest way of implementing such a grouping operation with MapReduce is to set up the mappers so that the key-value pairs they produce use the desired grouping key. The partitioning and sorting process then brings together all the records with the same key in the same reducer. Thus, grouping and joining look quite similar when implemented on top of MapReduce.

分组的另一个常见用途是整理特定用户会话的所有活动事件,以便找出用户采取的操作的顺序——这个过程称为会话化 [ 37 ]。例如,此类分析可用于确定显示新版本网站的用户是否比显示旧版本的用户更有可能进行购买(A/B 测试),或者计算某些营销是否活动是值得的。

Another common use for grouping is collating all the activity events for a particular user session, in order to find out the sequence of actions that the user took—a process called sessionization [37]. For example, such analysis could be used to work out whether users who were shown a new version of your website are more likely to make a purchase than those who were shown the old version (A/B testing), or to calculate whether some marketing activity is worthwhile.

如果您有多个处理用户请求的 Web 服务器,则特定用户的活动事件很可能分散在各个不同服务器的日志文件中。您可以通过使用会话 cookie、用户 ID 或类似标识符作为分组键,并将特定用户的所有活动事件集中到一个位置,同时将不同用户的事件分布到不同的分区来实现会话化。

If you have multiple web servers handling user requests, the activity events for a particular user are most likely scattered across various different servers’ log files. You can implement sessionization by using a session cookie, user ID, or similar identifier as the grouping key and bringing all the activity events for a particular user together in one place, while distributing different users’ events across different partitions.

处理倾斜

Handling skew

如果与单个键相关的数据量非常大,那么“将具有相同键的所有记录带到同一位置”的模式就会失效。例如,在社交网络中,大多数用户可能与几百人有联系,但少数名人可能拥有数百万粉丝。这种不成比例的活跃数据库记录被称为关键对象 [ 38 ]或热键

The pattern of “bringing all records with the same key to the same place” breaks down if there is a very large amount of data related to a single key. For example, in a social network, most users might be connected to a few hundred people, but a small number of celebrities may have many millions of followers. Such disproportionately active database records are known as linchpin objects [38] or hot keys.

在单个减速器中收集与名人相关的所有活动(例如,对他们发布的内容的回复)可能会导致严重的偏差(也称为热点),即一个减速器必须比其他减速器处理更多的记录(请参阅 “倾斜的工作负载和缓解热点”)。由于 MapReduce 作业仅在其所有映射器和化简器完成后才完成,因此任何后续作业必须等待最慢的化简器完成才能开始。

Collecting all activity related to a celebrity (e.g., replies to something they posted) in a single reducer can lead to significant skew (also known as hot spots)—that is, one reducer that must process significantly more records than the others (see “Skewed Workloads and Relieving Hot Spots”). Since a MapReduce job is only complete when all of its mappers and reducers have completed, any subsequent jobs must wait for the slowest reducer to complete before they can start.

如果连接输入有热键,您可以使用一些算法来补偿。例如, Pig 中的倾斜连接方法首先运行采样作业来确定哪些键是热门键 [ 39 ]。在执行实际连接时,映射器将与热键相关的任何记录发送到随机选择的几个减速器之一(与传统的 MapReduce 不同,后者根据键的哈希确定性地选择减速器)。对于连接的其他输入,与热键相关的记录需要复制到处理该键的所有减速器[ 40 ]。

If a join input has hot keys, there are a few algorithms you can use to compensate. For example, the skewed join method in Pig first runs a sampling job to determine which keys are hot [39]. When performing the actual join, the mappers send any records relating to a hot key to one of several reducers, chosen at random (in contrast to conventional MapReduce, which chooses a reducer deterministically based on a hash of the key). For the other input to the join, records relating to the hot key need to be replicated to all reducers handling that key [40].

这种技术将处理热键的工作分散到多个化简器上,这使得它可以更好地并行化,但代价是必须将其他连接输入复制到多个化简器。Crunch 中的分片连接方法类似,但需要显式指定热键而不是使用采样作业。这项技术也与我们在 “倾斜工作负载和缓解热点”中讨论的技术非常相似,即使用随机化来缓解分区数据库中的热点。

This technique spreads the work of handling the hot key over several reducers, which allows it to be parallelized better, at the cost of having to replicate the other join input to multiple reducers. The sharded join method in Crunch is similar, but requires the hot keys to be specified explicitly rather than using a sampling job. This technique is also very similar to one we discussed in “Skewed Workloads and Relieving Hot Spots”, using randomization to alleviate hot spots in a partitioned database.

Hive 的倾斜连接优化采用了另一种方法。它要求在表元数据中显式指定热键,并将与这些键相关的记录存储在与其他文件不同的文件中。在该表上执行联接时,它使用映射端联接(请参阅下一节)作为热键。

Hive’s skewed join optimization takes an alternative approach. It requires hot keys to be specified explicitly in the table metadata, and it stores records related to those keys in separate files from the rest. When performing a join on that table, it uses a map-side join (see the next section) for the hot keys.

当通过热键对记录进行分组并聚合它们时,您可以分两个阶段执行分组。第一个 MapReduce 阶段将记录发送到随机缩减器,以便每个缩减器对热键的记录子集执行分组,并为每个键输出更紧凑的聚合值。然后,第二个 MapReduce 作业将来自所有第一阶段缩减器的值组合成每个键的单个值。

When grouping records by a hot key and aggregating them, you can perform the grouping in two stages. The first MapReduce stage sends records to a random reducer, so that each reducer performs the grouping on a subset of records for the hot key and outputs a more compact aggregated value per key. The second MapReduce job then combines the values from all of the first-stage reducers into a single value per key.

地图端连接

Map-Side Joins

上一节中描述的连接算法在reducers中执行实际的连接逻辑,因此被称为reduce端连接。映射器负责准备输入数据:从每个输入记录中提取键和值,将键值对分配给reducer分区,并按键排序。

The join algorithms described in the last section perform the actual join logic in the reducers, and are hence known as reduce-side joins. The mappers take the role of preparing the input data: extracting the key and value from each input record, assigning the key-value pairs to a reducer partition, and sorting by key.

减少端方法的优点是您不需要对输入数据做出任何假设:无论其属性和结构如何,映射器都可以准备数据以准备加入。然而,缺点是所有排序、复制到减速器以及合并减速器输入的成本可能相当昂贵。根据可用的内存缓冲区,数据在经过 MapReduce 阶段时可能会多次写入磁盘 [ 37 ]。

The reduce-side approach has the advantage that you do not need to make any assumptions about the input data: whatever its properties and structure, the mappers can prepare the data to be ready for joining. However, the downside is that all that sorting, copying to reducers, and merging of reducer inputs can be quite expensive. Depending on the available memory buffers, data may be written to disk several times as it passes through the stages of MapReduce [37].

另一方面,如果您可以对输入数据做出某些假设,则可以通过使用所谓的映射端连接来加快连接速度。这种方法使用精简的 MapReduce 作业,其中没有缩减器,也没有排序。相反,每个映射器只是从分布式文件系统读取一个输入文件块,并将一个输出文件写入文件系统 - 仅此而已。

On the other hand, if you can make certain assumptions about your input data, it is possible to make joins faster by using a so-called map-side join. This approach uses a cut-down MapReduce job in which there are no reducers and no sorting. Instead, each mapper simply reads one input file block from the distributed filesystem and writes one output file to the filesystem—that is all.

广播哈希连接

Broadcast hash joins

执行地图端连接的最简单方法适用于将大数据集与小数据集连接的情况。特别是,小数据集需要足够小,以便可以将其完全加载到每个映射器的内存中。

The simplest way of performing a map-side join applies in the case where a large dataset is joined with a small dataset. In particular, the small dataset needs to be small enough that it can be loaded entirely into memory in each of the mappers.

例如,假设在图 10-2 的情况下,用户数据库足够小,可以容纳在内存中。在这种情况下,当映射器启动时,它可以首先将用户数据库从分布式文件系统读取到内存中的哈希表中。完成此操作后,映射器可以扫描用户活动事件并简单地在哈希表中查找每个事件的用户 ID。

For example, imagine in the case of Figure 10-2 that the user database is small enough to fit in memory. In this case, when a mapper starts up, it can first read the user database from the distributed filesystem into an in-memory hash table. Once this is done, the mapper can scan over the user activity events and simply look up the user ID for each event in the hash table.vi

仍然可以有多个映射任务:一个用于连接的大输入的每个文件块(在图 10-2的示例中,活动事件是大输入)。每个映射器都将小输入完全加载到内存中。

There can still be several map tasks: one for each file block of the large input to the join (in the example of Figure 10-2, the activity events are the large input). Each of these mappers loads the small input entirely into memory.

这种简单但有效的算法称为广播哈希连接:广播一词反映了这样一个事实:大输入的分区的每个映射器读取整个小输入(因此小输入有效地“广播”到大输入的所有分区)大输入),哈希一词 反映了它对哈希表的使用。Pig(名称为“replicated join”)、Hive(“MapJoin”)、Cascading 和 Crunch 支持此连接方法。它还用于数据仓库查询引擎,例如 Impala [ 41 ]。

This simple but effective algorithm is called a broadcast hash join: the word broadcast reflects the fact that each mapper for a partition of the large input reads the entirety of the small input (so the small input is effectively “broadcast” to all partitions of the large input), and the word hash reflects its use of a hash table. This join method is supported by Pig (under the name “replicated join”), Hive (“MapJoin”), Cascading, and Crunch. It is also used in data warehouse query engines such as Impala [41].

另一种替代方法是将小连接输入存储在本地磁盘上的只读索引中,而不是将小连接输入加载到内存中的哈希表中[42 ]。该索引的常用部分将保留在操作系统的页面缓存中,因此这种方法可以提供几乎与内存中哈希表一样快的随机访问查找,但实际上不需要数据集适合内存。

Instead of loading the small join input into an in-memory hash table, an alternative is to store the small join input in a read-only index on the local disk [42]. The frequently used parts of this index will remain in the operating system’s page cache, so this approach can provide random-access lookups almost as fast as an in-memory hash table, but without actually requiring the dataset to fit in memory.

分区哈希连接

Partitioned hash joins

如果映射端连接的输入以相同的方式分区,则哈希连接方法可以独立地应用于每个分区。在图 10-2的情况下,您可以安排活动事件和用户数据库根据用户 ID 的最后一位十进制数字进行分区(因此两侧各有 10 个分区)。例如,mapper 3首先将所有ID以3结尾的用户加载到哈希表中,然后扫描ID以3结尾的每个用户的所有活动事件。

If the inputs to the map-side join are partitioned in the same way, then the hash join approach can be applied to each partition independently. In the case of Figure 10-2, you might arrange for the activity events and the user database to each be partitioned based on the last decimal digit of the user ID (so there are 10 partitions on either side). For example, mapper 3 first loads all users with an ID ending in 3 into a hash table, and then scans over all the activity events for each user whose ID ends in 3.

如果分区正确完成,您可以确保您可能想要连接的所有记录都位于同一编号的分区中,因此每个映射器仅从每个输入数据集中读取一个分区就足够了。这样做的优点是每个映射器可以将更少量的数据加载到其哈希表中。

If the partitioning is done correctly, you can be sure that all the records you might want to join are located in the same numbered partition, and so it is sufficient for each mapper to only read one partition from each of the input datasets. This has the advantage that each mapper can load a smaller amount of data into its hash table.

仅当连接的两个输入具有相同数量的分区,并且记录基于相同的键和相同的哈希函数分配给分区时,此方法才有效。如果输入是由已经执行此分组的先前 MapReduce 作业生成的,那么这可能是一个合理的假设。

This approach only works if both of the join’s inputs have the same number of partitions, with records assigned to partitions based on the same key and the same hash function. If the inputs are generated by prior MapReduce jobs that already perform this grouping, then this can be a reasonable assumption to make.

分区哈希连接在 Hive 中称为分桶映射连接[ 37 ]。

Partitioned hash joins are known as bucketed map joins in Hive [37].

映射端合并连接

Map-side merge joins

如果输入数据集不仅以相同的方式分区,而且还基于相同的键排序 ,则应用映射端联接的另一种变体。在这种情况下,输入是否足够小以适合内存并不重要,因为映射器可以执行通常由减速器完成的相同合并操作:按升序键的顺序增量读取两个输入文件,并且匹配具有相同键的记录。

Another variant of a map-side join applies if the input datasets are not only partitioned in the same way, but also sorted based on the same key. In this case, it does not matter whether the inputs are small enough to fit in memory, because a mapper can perform the same merging operation that would normally be done by a reducer: reading both input files incrementally, in order of ascending key, and matching records with the same key.

如果地图端合并联接是可能的,则可能意味着先前的 MapReduce 作业首先将输入数据集带入此分区和排序的形式。原则上,这个连接可以在前一个作业的reduce阶段执行。但是,在单独的仅地图作业中执行合并联接可能仍然是合适的,例如,如果除了此特定联接之外还需要分区和排序的数据集用于其他目的。

If a map-side merge join is possible, it probably means that prior MapReduce jobs brought the input datasets into this partitioned and sorted form in the first place. In principle, this join could have been performed in the reduce stage of the prior job. However, it may still be appropriate to perform the merge join in a separate map-only job, for example if the partitioned and sorted datasets are also needed for other purposes besides this particular join.

具有映射端连接的 MapReduce 工作流程

MapReduce workflows with map-side joins

当下游作业使用 MapReduce 连接的输出时,map 端连接或化简端连接的选择会影响输出的结构。减少端联接的输出按联接键进行分区和排序,而映射端联接的输出以与大输入相同的方式进行分区和排序(因为为每个文件块启动一个映射任务)连接的大输入,无论是否使用分区连接或广播连接)。

When the output of a MapReduce join is consumed by downstream jobs, the choice of map-side or reduce-side join affects the structure of the output. The output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partitioned and sorted in the same way as the large input (since one map task is started for each file block of the join’s large input, regardless of whether a partitioned or broadcast join is used).

正如所讨论的,映射端连接还对其输入数据集的大小、排序和分区做出更多假设。在优化连接策略时,了解分布式文件系统中数据集的物理布局变得很重要:仅了解编码格式和存储数据的目录名称是不够的;您还必须知道分区的数量以及数据分区和排序所依据的键。

As discussed, map-side joins also make more assumptions about the size, sorting, and partitioning of their input datasets. Knowing about the physical layout of datasets in the distributed filesystem becomes important when optimizing join strategies: it is not sufficient to just know the encoding format and the name of the directory in which the data is stored; you must also know the number of partitions and the keys by which the data is partitioned and sorted.

在 Hadoop 生态系统中,这种关于数据集分区的元数据通常保存在 HCatalog 和 Hive 元存储中 [ 37 ]。

In the Hadoop ecosystem, this kind of metadata about the partitioning of datasets is often maintained in HCatalog and the Hive metastore [37].

批处理工作流程的输出

The Output of Batch Workflows

我们已经讨论了很多用于实现 MapReduce 作业工作流的各种算法,但是我们忽略了一个重要的问题:所有处理完成后的结果是什么?我们为什么要运行所有这些工作?

We have talked a lot about the various algorithms for implementing workflows of MapReduce jobs, but we neglected an important question: what is the result of all of that processing, once it is done? Why are we running all these jobs in the first place?

就数据库查询而言,我们将事务处理 (OLTP) 目的与分析目的区分开来(请参阅“事务处理还是分析?”)。我们看到,OLTP 查询通常使用索引按键查找少量记录,以便将它们呈现给用户(例如,在网页上)。另一方面,分析查询通常会扫描大量记录,执行分组和聚合,并且输出通常采用报告的形式:显示指标随时间变化的图表,或者根据数据显示前 10 项某些排名,或将某些数量细分为子类别。此类报告的使用者通常是需要做出业务决策的分析师或经理。

In the case of database queries, we distinguished transaction processing (OLTP) purposes from analytic purposes (see “Transaction Processing or Analytics?”). We saw that OLTP queries generally look up a small number of records by key, using indexes, in order to present them to a user (for example, on a web page). On the other hand, analytic queries often scan over a large number of records, performing groupings and aggregations, and the output often has the form of a report: a graph showing the change in a metric over time, or the top 10 items according to some ranking, or a breakdown of some quantity into subcategories. The consumer of such a report is often an analyst or a manager who needs to make business decisions.

批处理适用于哪里?它不是事务处理,也不是分析。它更接近分析,因为批处理通常会扫描输入数据集的大部分。然而,MapReduce 作业的工作流程与用于分析目的的 SQL 查询不同(请参阅“Hadoop 与分布式数据库的比较”)。批处理过程的输出通常不是报告,而是某种 其他类型的结构。

Where does batch processing fit in? It is not transaction processing, nor is it analytics. It is closer to analytics, in that a batch process typically scans over large portions of an input dataset. However, a workflow of MapReduce jobs is not the same as a SQL query used for analytic purposes (see “Comparing Hadoop to Distributed Databases”). The output of a batch process is often not a report, but some other kind of structure.

建立搜索索引

Building search indexes

Google 最初使用 MapReduce 是为其搜索引擎构建索引,该索引是作为 5 到 10 个 MapReduce 作业的工作流程实现的 [ 1 ]。尽管 Google 后来不再为此目的使用 MapReduce [ 43 ],但如果您通过构建搜索索引的角度来看待 MapReduce,它会有助于理解 MapReduce。(即使在今天,Hadoop MapReduce 仍然是为 Lucene/Solr 构建索引的好方法 [ 44 ]。)

Google’s original use of MapReduce was to build indexes for its search engine, which was implemented as a workflow of 5 to 10 MapReduce jobs [1]. Although Google later moved away from using MapReduce for this purpose [43], it helps to understand MapReduce if you look at it through the lens of building a search index. (Even today, Hadoop MapReduce remains a good way of building indexes for Lucene/Solr [44].)

我们在“全文搜索和模糊索引”中简要了解了诸如 Lucene 之类的全文搜索索引是如何工作的:它是一个文件(术语词典),您可以在其中高效地查找特定关键字并找到所有关键字的列表。包含该关键字的文档 ID(发布列表)。这是搜索索引的一个非常简化的视图 - 实际上它需要各种附加数据,以便按相关性对搜索结果进行排名、纠正拼写错误、解决同义词等 - 但原则是成立的。

We saw briefly in “Full-text search and fuzzy indexes” how a full-text search index such as Lucene works: it is a file (the term dictionary) in which you can efficiently look up a particular keyword and find the list of all the document IDs containing that keyword (the postings list). This is a very simplified view of a search index—in reality it requires various additional data, in order to rank search results by relevance, correct misspellings, resolve synonyms, and so on—but the principle holds.

如果您需要对一组固定的文档执行全文搜索,那么批处理是构建索引的一种非常有效的方法:映射器根据需要对文档集进行分区,每个减速器为其分区构建索引,并将索引文件写入分布式文件系统。构建此类文档分区索引(请参阅“分区和二级索引”)可以很好地并行化。由于按关键字查询搜索索引是只读操作,因此这些索引文件一旦创建就不可更改。

If you need to perform a full-text search over a fixed set of documents, then a batch process is a very effective way of building the indexes: the mappers partition the set of documents as needed, each reducer builds the index for its partition, and the index files are written to the distributed filesystem. Building such document-partitioned indexes (see “Partitioning and Secondary Indexes”) parallelizes very well. Since querying a search index by keyword is a read-only operation, these index files are immutable once they have been created.

如果索引的文档集发生更改,一种选择是定期重新运行整个文档集的整个索引工作流程,并在完成后用新的索引文件批量替换以前的索引文件。如果只有少量文档发生变化,这种方法的计算成本可能会很高,但它的优点是索引过程很容易推理:文档输入,索引输出。

If the indexed set of documents changes, one option is to periodically rerun the entire indexing workflow for the entire set of documents, and replace the previous index files wholesale with the new index files when it is done. This approach can be computationally expensive if only a small number of documents have changed, but it has the advantage that the indexing process is very easy to reason about: documents in, indexes out.

或者,可以增量地构建索引。正如第 3 章中所讨论的,如果您想要添加、删除或更新索引中的文档,Lucene 会写出新的段文件,并在后台异步合并和压缩段文件。我们将在第 11 章中看到更多关于这种增量处理的内容。

Alternatively, it is possible to build indexes incrementally. As discussed in Chapter 3, if you want to add, remove, or update documents in an index, Lucene writes out new segment files and asynchronously merges and compacts segment files in the background. We will see more on such incremental processing in Chapter 11.

键值存储为批处理输出

Key-value stores as batch process output

搜索索引只是批处理工作流程可能输出的示例之一。批处理的另一个常见用途是构建机器学习系统,例如分类器(例如,垃圾邮件过滤器、异常检测、图像识别)和推荐系统(例如,您可能认识的人、您可能感兴趣的产品或相关搜索[29 ] ])。

Search indexes are just one example of the possible outputs of a batch processing workflow. Another common use for batch processing is to build machine learning systems such as classifiers (e.g., spam filters, anomaly detection, image recognition) and recommendation systems (e.g., people you may know, products you may be interested in, or related searches [29]).

这些批处理作业的输出通常是某种数据库:例如,可以通过用户 ID 查询以获得该用户建议的朋友的数据库,或者可以通过产品 ID 查询以获得相关产品列表的数据库[ 45 ]。

The output of those batch jobs is often some kind of database: for example, a database that can be queried by user ID to obtain suggested friends for that user, or a database that can be queried by product ID to get a list of related products [45].

这些数据库需要从处理用户请求的 Web 应用程序中查询,该应用程序通常与 Hadoop 基础设施分离。那么批处理过程的输出如何返回到 Web 应用程序可以查询的数据库中呢?

These databases need to be queried from the web application that handles user requests, which is usually separate from the Hadoop infrastructure. So how does the output from the batch process get back into a database where the web application can query it?

最明显的选择可能是直接在映射器或化简器中使用您最喜欢的数据库的客户端库,并将批处理作业直接写入数据库服务器,一次一条记录。这可行(假设您的防火墙规则允许从 Hadoop 环境直接访问生产数据库),但由于以下几个原因,这是一个坏主意:

The most obvious choice might be to use the client library for your favorite database directly within a mapper or reducer, and to write from the batch job directly to the database server, one record at a time. This will work (assuming your firewall rules allow direct access from your Hadoop environment to your production databases), but it is a bad idea for several reasons:

  • 正如前面在连接上下文中所讨论的,对每条记录发出网络请求比批处理任务的正常吞吐量要慢几个数量级。即使客户端库支持批处理,性能也可能很差。

  • As discussed previously in the context of joins, making a network request for every single record is orders of magnitude slower than the normal throughput of a batch task. Even if the client library supports batching, performance is likely to be poor.

  • MapReduce 作业通常并行运行许多任务。如果所有映射器或缩减器以批处理的预期速率同时写入同一输出数据库,则该数据库很容易被淹没,并且其查询性能可能会受到影响。这反过来又会导致系统其他部分出现操作问题[ 35 ]。

  • MapReduce jobs often run many tasks in parallel. If all the mappers or reducers concurrently write to the same output database, with a rate expected of a batch process, that database can easily be overwhelmed, and its performance for queries is likely to suffer. This can in turn cause operational problems in other parts of the system [35].

  • 通常,MapReduce 为作业输出提供干净的“全有或全无”保证:如果作业成功,则结果是每个任务恰好运行一次的输出,即使某些任务失败并必须一路重试;如果整个作业失败,则不会产生任何输出。然而,从作业内部写入外部系统会产生外部可见的副作用,而这些副作用无法以这种方式隐藏。因此,您必须担心部分完成的作业的结果对其他系统可见,以及 Hadoop 任务尝试和推测执行的复杂性。

  • Normally, MapReduce provides a clean all-or-nothing guarantee for job output: if a job succeeds, the result is the output of running every task exactly once, even if some tasks failed and had to be retried along the way; if the entire job fails, no output is produced. However, writing to an external system from inside a job produces externally visible side effects that cannot be hidden in this way. Thus, you have to worry about the results from partially completed jobs being visible to other systems, and the complexities of Hadoop task attempts and speculative execution.

更好的解决方案是在批处理作业中构建 一个全新的数据库,并将其作为文件写入分布式文件系统中作业的输出目录,就像上一节中的搜索索引一样。这些数据文件一旦写入就不可更改,并且可以批量加载到处理只读查询的服务器中。各种键值存储支持在 MapReduce 作业中构建数据库文件,包括 Voldemort [ 46 ]、Terrapin [ 47 ]、ElephantDB [ 48 ] 和 HBase 批量加载 [ 49 ]。

A much better solution is to build a brand-new database inside the batch job and write it as files to the job’s output directory in the distributed filesystem, just like the search indexes in the last section. Those data files are then immutable once written, and can be loaded in bulk into servers that handle read-only queries. Various key-value stores support building database files in MapReduce jobs, including Voldemort [46], Terrapin [47], ElephantDB [48], and HBase bulk loading [49].

构建这些数据库文件是 MapReduce 的一个很好的用途:使用映射器提取键,然后按该键排序已经是构建索引所需的大量工作。由于大多数这些键值存储都是只读的(文件只能由批处理作业写入一次,然后是不可变的),因此数据结构非常简单。例如,它们不需要 WAL(请参阅 “使 B 树可靠”)。

Building these database files is a good use of MapReduce: using a mapper to extract a key and then sorting by that key is already a lot of the work required to build an index. Since most of these key-value stores are read-only (the files can only be written once by a batch job and are then immutable), the data structures are quite simple. For example, they do not require a WAL (see “Making B-trees reliable”).

将数据加载到 Voldemort 时,服务器继续向旧数据文件提供请求,同时将新数据文件从分布式文件系统复制到服务器的本地磁盘。复制完成后,服务器自动切换到查询新文件。如果在此过程中出现任何问题,它可以轻松地再次切换回旧文件,因为它们仍然存在并且不可变[ 46 ]。

When loading data into Voldemort, the server continues serving requests to the old data files while the new data files are copied from the distributed filesystem to the server’s local disk. Once the copying is complete, the server atomically switches over to querying the new files. If anything goes wrong in this process, it can easily switch back to the old files again, since they are still there and immutable [46].

批处理输出的原理

Philosophy of batch process outputs

我们在本章前面讨论的 Unix 哲学(“Unix 哲学”)通过非常明确地了解数据流来鼓励实验:程序读取其输入并写入其输出。在此过程中,输入保持不变,任何先前的输出都将完全替换为新的输出,并且没有其他副作用。这意味着您可以根据需要多次重新运行命令、调整或调试它,而不会弄乱系统的状态。

The Unix philosophy that we discussed earlier in this chapter (“The Unix Philosophy”) encourages experimentation by being very explicit about dataflow: a program reads its input and writes its output. In the process, the input is left unchanged, any previous output is completely replaced with the new output, and there are no other side effects. This means that you can rerun a command as often as you like, tweaking or debugging it, without messing up the state of your system.

MapReduce 作业的输出处理遵循相同的理念。通过将输入视为不可变并避免副作用(例如写入外部数据库),批处理作业不仅可以获得良好的性能,而且变得更容易维护:

The handling of output from MapReduce jobs follows the same philosophy. By treating inputs as immutable and avoiding side effects (such as writing to external databases), batch jobs not only achieve good performance but also become much easier to maintain:

  • 如果您在代码中引入错误并且输出错误或损坏,您只需回滚到代码的先前版本并重新运行作业,输出将再次正确。或者,更简单的是,您可以将旧输出保留在不同的目录中,然后简单地切换回它。具有读写事务的数据库不具有此属性:如果您部署有错误的代码,将错误的数据写入数据库,则回滚代码将无法修复数据库中的数据。 (能够从有错误的代码中恢复的想法被称为人为容错 [ 50 ]。)

  • If you introduce a bug into the code and the output is wrong or corrupted, you can simply roll back to a previous version of the code and rerun the job, and the output will be correct again. Or, even simpler, you can keep the old output in a different directory and simply switch back to it. Databases with read-write transactions do not have this property: if you deploy buggy code that writes bad data to the database, then rolling back the code will do nothing to fix the data in the database. (The idea of being able to recover from buggy code has been called human fault tolerance [50].)

  • 由于易于回滚,功能开发可以比在错误可能意味着不可逆转的损害的环境中进行得更快。这种最小化不可逆性的原则有利于敏捷软件开发[ 51 ]。

  • As a consequence of this ease of rolling back, feature development can proceed more quickly than in an environment where mistakes could mean irreversible damage. This principle of minimizing irreversibility is beneficial for Agile software development [51].

  • 如果映射或化简任务失败,MapReduce 框架会自动重新调度它并在相同的输入上再次运行它。如果失败是由于代码中的错误导致的,那么它会不断崩溃,并最终导致作业在尝试几次后失败;但如果故障是由于暂时性问题造成的,则该故障是可以容忍的。这种自动重试之所以安全,是因为输入是不可变的,并且失败任务的输出会被 MapReduce 框架丢弃。

  • If a map or reduce task fails, the MapReduce framework automatically re-schedules it and runs it again on the same input. If the failure is due to a bug in the code, it will keep crashing and eventually cause the job to fail after a few attempts; but if the failure is due to a transient issue, the fault is tolerated. This automatic retry is only safe because inputs are immutable and outputs from failed tasks are discarded by the MapReduce framework.

  • 同一组文件可以用作各种不同作业的输入,包括计算指标并评估作业的输出是否具有预期特征的监视作业(例如,通过将其与先前运行的输出进行比较并测量差异)。

  • The same set of files can be used as input for various different jobs, including monitoring jobs that calculate metrics and evaluate whether a job’s output has the expected characteristics (for example, by comparing it to the output from the previous run and measuring discrepancies).

  • 与 Unix 工具一样,MapReduce 作业将逻辑与连接(配置输入和输出目录)分开,这提供了关注点分离并实现了代码的潜在重用:一个团队可以专注于实现一项擅长完成一件事的作业,而其他团队可以决定何时何地运行该作业。

  • Like Unix tools, MapReduce jobs separate logic from wiring (configuring the input and output directories), which provides a separation of concerns and enables potential reuse of code: one team can focus on implementing a job that does one thing well, while other teams can decide where and when to run that job.

在这些领域,适用于 Unix 的设计原则似乎也适用于 Hadoop,但 Unix 和 Hadoop 在某些方面也有所不同。例如,因为大多数 Unix 工具都假定非类型化文本文件,所以它们必须进行大量输入解析(我们在本章开头的日志分析示例用于{print $7}提取 URL)。 在 Hadoop 上,通过使用更结构化的文件格式来消除一些低值语法转换:经常使用Avro(请参阅“Avro”)和 Parquet(请参阅“面向列的存储” ),因为它们提供高效的基于模式的编码并允许其模式随着时间的推移而演变(参见第 4 章)。

In these areas, the design principles that worked well for Unix also seem to be working well for Hadoop—but Unix and Hadoop also differ in some ways. For example, because most Unix tools assume untyped text files, they have to do a lot of input parsing (our log analysis example at the beginning of the chapter used {print $7} to extract the URL). On Hadoop, some of those low-value syntactic conversions are eliminated by using more structured file formats: Avro (see “Avro”) and Parquet (see “Column-Oriented Storage”) are often used, as they provide efficient schema-based encoding and allow evolution of their schemas over time (see Chapter 4).

Hadoop 与分布式数据库的比较

Comparing Hadoop to Distributed Databases

正如我们所看到的,Hadoop 有点像 Unix 的分布式版本,其中 HDFS 是文件系统,而 MapReduce 是 Unix 进程的一个奇怪的实现(它恰好总是在映射阶段和化简阶段之间运行实用程序)sort。我们了解了如何在这些原语之上实现各种连接和分组操作。

As we have seen, Hadoop is somewhat like a distributed version of Unix, where HDFS is the filesystem and MapReduce is a quirky implementation of a Unix process (which happens to always run the sort utility between the map phase and the reduce phase). We saw how you can implement various join and grouping operations on top of these primitives.

当 MapReduce 论文 [ 1 ] 发表时,从某种意义上说,它根本就不是什么新鲜事。我们在过去几节中讨论的所有处理和并行连接算法已经在十多年前在所谓的 大规模并行处理(MPP) 数据库中实现了 [ 3 , 40 ]。例如,Gamma 数据库机、Teradata 和 Tandem NonStop SQL 是该领域的先驱[ 52 ]。

When the MapReduce paper [1] was published, it was—in some sense—not at all new. All of the processing and parallel join algorithms that we discussed in the last few sections had already been implemented in so-called massively parallel processing (MPP) databases more than a decade previously [3, 40]. For example, the Gamma database machine, Teradata, and Tandem NonStop SQL were pioneers in this area [52].

最大的区别是 MPP 数据库专注于在机器集群上并行执行分析 SQL 查询,而 MapReduce 和分布式文件系统 [ 19 ] 的结合提供了更像可以运行任意程序的通用操作系统的东西。

The biggest difference is that MPP databases focus on parallel execution of analytic SQL queries on a cluster of machines, while the combination of MapReduce and a distributed filesystem [19] provides something much more like a general-purpose operating system that can run arbitrary programs.

存储多样性

Diversity of storage

数据库要求您根据特定模型(例如关系模型或文档模型)构建数据,而分布式文件系统中的文件只是字节序列,可以使用任何数据模型和编码来编写。它们可能是数据库记录的集合,但它们同样可以是文本、图像、视频、传感器读数、稀疏矩阵、特征向量、基因组序列或任何其他类型的数据。

Databases require you to structure data according to a particular model (e.g., relational or documents), whereas files in a distributed filesystem are just byte sequences, which can be written using any data model and encoding. They might be collections of database records, but they can equally well be text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences, or any other kind of data.

坦率地说,Hadoop 开启了将数据不加区别地转储到 HDFS 中的可能性,直到后来才弄清楚如何进一步处理它[ 53 ]。相比之下,MPP 数据库通常需要在将数据导入数据库的专有存储格式之前对数据和查询模式进行仔细的预先建模。

To put it bluntly, Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further [53]. By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data into the database’s proprietary storage format.

从纯粹主义者的角度来看,这种仔细的建模和导入似乎是可取的,因为这意味着数据库的用户可以使用更高质量的数据。然而,在实践中,似乎简单地快速提供数据(即使它是一种古怪的、难以使用的原始格式)通常比尝试预先决定理想的数据模型更有价值[54 ]

From a purist’s point of view, it may seem that this careful modeling and import is desirable, because it means users of the database have better-quality data to work with. However, in practice, it appears that simply making data available quickly—even if it is in a quirky, difficult-to-use, raw format—is often more valuable than trying to decide on the ideal data model up front [54].

这个想法类似于数据仓库(请参阅“数据仓库”):简单地将来自大型组织各个部分的数据集中到一个地方是有价值的,因为它可以跨以前不同的数据集进行连接。MPP 数据库所需的仔细模式设计会减慢集中数据收集的速度;以原始形式收集数据,然后再考虑模式设计,可以加快数据收集速度(这个概念有时称为“数据湖”或“企业数据中心”[55]

The idea is similar to a data warehouse (see “Data Warehousing”): simply bringing data from various parts of a large organization together in one place is valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP database slows down that centralized data collection; collecting data in its raw form, and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as a “data lake” or “enterprise data hub” [55]).

不加区别的数据转储转移了解释数据的负担:数据的解释变成了消费者的问题,而不是强迫数据集的生产者将其转换为标准化格式(读取时模式方法[56] 参见“文档模型中的架构灵活性”)。如果生产者和消费者是具有不同优先级的不同团队,这可能是一个优势。甚至可能不存在一种理想的数据模型,而是适用于不同目的的不同数据视图。简单地以原始形式转储数据就可以进行多次此类转换。这种方法被称为寿司原则:“原始数据更好”[ 57 ]。

Indiscriminate data dumping shifts the burden of interpreting the data: instead of forcing the producer of a dataset to bring it into a standardized format, the interpretation of the data becomes the consumer’s problem (the schema-on-read approach [56]; see “Schema flexibility in the document model”). This can be an advantage if the producer and consumers are different teams with different priorities. There may not even be one ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the sushi principle: “raw data is better” [57].

因此,Hadoop 经常用于实现 ETL 流程(请参阅“数据仓库”):来自事务处理系统的数据以某种原始形式转储到分布式文件系统中,然后编写 MapReduce 作业来清理该数据,将其转换为关系形式,并将其导入 MPP 数据仓库以进行分析。数据建模仍然会发生,但它是在一个单独的步骤中,与数据收集分离。这种解耦是可能的,因为分布式文件系统支持以任何格式编码的数据。

Thus, Hadoop has often been used for implementing ETL processes (see “Data Warehousing”): data from transaction processing systems is dumped into the distributed filesystem in some raw form, and then MapReduce jobs are written to clean up that data, transform it into a relational form, and import it into an MPP data warehouse for analytic purposes. Data modeling still happens, but it is in a separate step, decoupled from the data collection. This decoupling is possible because a distributed filesystem supports data encoded in any format.

加工模式多样化

Diversity of processing models

MPP 数据库是整体的、紧密集成的软件,负责磁盘上的存储布局、查询规划、调度和执行。由于这些组件都可以针对数据库的特定需求进行调整和优化,因此系统作为一个整体可以在其设计的查询类型上实现非常好的性能。此外,SQL 查询语言允许表达性查询和优雅的语义,而无需编写代码,使得业务分析师使用的图形工具(例如 Tableau)可以访问它。

MPP databases are monolithic, tightly integrated pieces of software that take care of storage layout on disk, query planning, scheduling, and execution. Since these components can all be tuned and optimized for the specific needs of the database, the system as a whole can achieve very good performance on the types of queries for which it is designed. Moreover, the SQL query language allows expressive queries and elegant semantics without the need to write code, making it accessible to graphical tools used by business analysts (such as Tableau).

另一方面,并​​非所有类型的处理都可以合理地表达为 SQL 查询。例如,如果您正在构建机器学习和推荐系统,或具有相关性排名模型的全文搜索索引,或执行图像分析,您很可能需要更通用的数据处理模型。这些类型的处理通常针​​对特定应用程序(例如,机器学习的特征工程、机器翻译的自然语言模型、欺诈预测的风险估计函数),因此它们不可避免地需要编写代码,而不仅仅是查询。

On the other hand, not all kinds of processing can be sensibly expressed as SQL queries. For example, if you are building machine learning and recommendation systems, or full-text search indexes with relevance ranking models, or performing image analysis, you most likely need a more general model of data processing. These kinds of processing are often very specific to a particular application (e.g., feature engineering for machine learning, natural language models for machine translation, risk estimation functions for fraud prediction), so they inevitably require writing code, not just queries.

MapReduce 使工程师能够轻松地在大型数据集上运行自己的代码。如果你有HDFS和MapReduce,你可以在它上面构建一个SQL查询执行引擎,事实上这就是Hive项目所做的[ 31 ]。但是,您还可以编写许多其他形式的批处理过程,这些处理过程不适合表达为 SQL 查询。

MapReduce gave engineers the ability to easily run their own code over large datasets. If you have HDFS and MapReduce, you can build a SQL query execution engine on top of it, and indeed this is what the Hive project did [31]. However, you can also write many other forms of batch processes that do not lend themselves to being expressed as a SQL query.

随后,人们发现 MapReduce 的局限性太大,对于某些类型的处理来说性能太差,因此在 Hadoop 之上开发了各种其他处理模型(我们将在“Beyond MapReduce”中看到其中一些模型。拥有 SQL 和 MapReduce 这两种处理模型是不够的:还需要更多不同的模型!由于 Hadoop 平台的开放性,实现一系列方法是可行的,这在单一 MPP 数据库的范围内是不可能实现的 [58 ]

Subsequently, people found that MapReduce was too limiting and performed too badly for some types of processing, so various other processing models were developed on top of Hadoop (we will see some of them in “Beyond MapReduce”). Having two processing models, SQL and MapReduce, was not enough: even more different models were needed! And due to the openness of the Hadoop platform, it was feasible to implement a whole range of approaches, which would not have been possible within the confines of a monolithic MPP database [58].

至关重要的是,这些不同的处理模型都可以在单个共享计算机集群上运行,所有模型都访问分布式文件系统上的相同文件。在 Hadoop 方法中,无需将数据导入多个不同的专用系统以进行不同类型的处理:系统足够灵活,可以支持同一集群内的不同工作负载。不必移动数据可以更轻松地从数据中获取价值,并且更容易尝试新的处理模型。

Crucially, those various processing models can all be run on a single shared-use cluster of machines, all accessing the same files on the distributed filesystem. In the Hadoop approach, there is no need to import the data into several different specialized systems for different kinds of processing: the system is flexible enough to support a diverse set of workloads within the same cluster. Not having to move data around makes it a lot easier to derive value from the data, and a lot easier to experiment with new processing models.

Hadoop 生态系统包括随机访问 OLTP 数据库,例如 HBase(请参阅 “SSTables 和 LSM-Trees”)和 MPP 型分析数据库,例如 Impala [ 41 ]。HBase和Impala都不使用MapReduce,但都使用HDFS进行存储。它们是访问和处理数据的非常不同的方法,但它们仍然可以共存并集成在同一系统中。

The Hadoop ecosystem includes both random-access OLTP databases such as HBase (see “SSTables and LSM-Trees”) and MPP-style analytic databases such as Impala [41]. Neither HBase nor Impala uses MapReduce, but both use HDFS for storage. They are very different approaches to accessing and processing data, but they can nevertheless coexist and be integrated in the same system.

针对常见故障进行设计

Designing for frequent faults

将 MapReduce 与 MPP 数据库进行比较时,设计方法上还有两个差异很突出:故障处理以及内存和磁盘的使用。与在线系统相比,批处理对故障的敏感度较低,因为如果发生故障,它们不会立即影响用户,并且始终可以再次运行。

When comparing MapReduce to MPP databases, two more differences in design approach stand out: the handling of faults and the use of memory and disk. Batch processes are less sensitive to faults than online systems, because they do not immediately affect users if they fail and they can always be run again.

如果执行查询时节点崩溃,大多数 MPP 数据库会中止整个查询,并让用户重新提交查询或自动再次运行它 [3 ]。由于查询通常运行几秒钟或最多几分钟,这种处理错误的方式是可以接受的,因为重试的成本不会太大。MPP 数据库还倾向于将尽可能多的数据保留在内存中(例如,使用散列连接)以避免从磁盘读取的成本。

If a node crashes while a query is executing, most MPP databases abort the entire query, and either let the user resubmit the query or automatically run it again [3]. As queries normally run for a few seconds or a few minutes at most, this way of handling errors is acceptable, since the cost of retrying is not too great. MPP databases also prefer to keep as much data as possible in memory (e.g., using hash joins) to avoid the cost of reading from disk.

另一方面,MapReduce 可以通过在单个任务的粒度上重试工作来容忍映射或化简任务的失败,而不影响整个作业。它还非常渴望将数据写入磁盘,部分是为了容错,部分是假设数据集太大而无法放入内存。

On the other hand, MapReduce can tolerate the failure of a map or reduce task without it affecting the job as a whole by retrying work at the granularity of an individual task. It is also very eager to write data to disk, partly for fault tolerance, and partly on the assumption that the dataset will be too big to fit in memory anyway.

MapReduce 方法更适合较大的作业:处理大量数据并运行很长时间的作业,以至于它们在此过程中可能会遇到至少一个任务失败。在这种情况下,由于单个任务失败而重新运行整个作业将是浪费的。即使以单个任务的粒度进行恢复会带来开销,导致无故障处理速度变慢,但如果任务失败率足够高,这仍然是一个合理的权衡。

The MapReduce approach is more appropriate for larger jobs: jobs that process so much data and run for such a long time that they are likely to experience at least one task failure along the way. In that case, rerunning the entire job due to a single task failure would be wasteful. Even if recovery at the granularity of an individual task introduces overheads that make fault-free processing slower, it can still be a reasonable trade-off if the rate of task failures is high enough.

但这些假设有多现实呢?在大多数集群中,机器故障确实会发生,但并不是很频繁——可能很少见,以至于大多数作业都不会遇到机器故障。为了容错而付出巨大的开销真的值得吗?

But how realistic are these assumptions? In most clusters, machine failures do occur, but they are not very frequent—probably rare enough that most jobs will not experience a machine failure. Is it really worth incurring significant overheads for the sake of fault tolerance?

要了解 MapReduce 节省内存使用和任务级恢复的原因,了解 MapReduce 最初设计的环境会有所帮助。谷歌拥有混合用途数据中心,其中在线生产服务和离线批处理作业在同一台机器上运行。每个任务都有一个使用容器强制执行的资源分配(CPU 核心、RAM、磁盘空间等)。每个任务也有一个优先级,如果较高优先级的任务需要更多的资源,则可以终止(抢占)同一台机器上的较低优先级的任务,以释放资源。优先级还决定了计算资源的定价:团队必须为他们使用的资源付费,优先级较高的进程成本更高[ 59 ]。

To understand the reasons for MapReduce’s sparing use of memory and task-level recovery, it is helpful to look at the environment for which MapReduce was originally designed. Google has mixed-use datacenters, in which online production services and offline batch jobs run on the same machines. Every task has a resource allocation (CPU cores, RAM, disk space, etc.) that is enforced using containers. Every task also has a priority, and if a higher-priority task needs more resources, lower-priority tasks on the same machine can be terminated (preempted) in order to free up resources. Priority also determines pricing of the computing resources: teams must pay for the resources they use, and higher-priority processes cost more [59].

这种架构允许过度使用非生产(低优先级)计算资源,因为系统知道它可以在必要时回收资源。与分离生产和非生产任务的系统相比,过度使用资源反过来可以更好地利用机器并提高效率。然而,由于 MapReduce 作业以低优先级运行,因此它们存在随时被抢占的风险,因为更高优先级的进程需要它们的资源。批处理作业有效地“捡起桌子底下的碎片”,使用高优先级进程获取所需资源后剩余的任何计算资源。

This architecture allows non-production (low-priority) computing resources to be overcommitted, because the system knows that it can reclaim the resources if necessary. Overcommitting resources in turn allows better utilization of machines and greater efficiency compared to systems that segregate production and non-production tasks. However, as MapReduce jobs run at low priority, they run the risk of being preempted at any time because a higher-priority process requires their resources. Batch jobs effectively “pick up the scraps under the table,” using any computing resources that remain after the high-priority processes have taken what they need.

在 Google,运行一小时的 MapReduce 任务有大约 5% 的风险被终止,以便为更高优先级的进程腾出空间。这个比率比由于硬件问题、机器重启或其他原因导致的故障率高出一个数量级以上[ 59 ]。按照这种抢占率,如果一个作业有 100 个任务,每个任务运行 10 分钟,则至少有一个任务在完成之前被终止的风险大于 50%。

At Google, a MapReduce task that runs for an hour has an approximately 5% risk of being terminated to make space for a higher-priority process. This rate is more than an order of magnitude higher than the rate of failures due to hardware issues, machine reboot, or other reasons [59]. At this rate of preemptions, if a job has 100 tasks that each run for 10 minutes, there is a risk greater than 50% that at least one task will be terminated before it is finished.

这就是为什么 MapReduce 被设计为能够容忍频繁的意外任务终止:这并不是因为硬件特别不可靠,而是因为任意终止进程的自由可以在计算集群中更好地利用资源。

And this is why MapReduce is designed to tolerate frequent unexpected task termination: it’s not because the hardware is particularly unreliable, it’s because the freedom to arbitrarily terminate processes enables better resource utilization in a computing cluster.

在开源集群调度器中,抢占式的应用较少。YARN 的CapacityScheduler 支持抢占,以平衡不同队列的资源分配[ 58 ],但在撰写本文时,YARN、Mesos 或Kubernetes 不支持一般优先级抢占[ 60 ]。在任务不经常终止的环境中,MapReduce 的设计决策就没那么有意义了。在下一节中,我们将研究 MapReduce 的一些替代方案,以做出不同的设计决策。

Among open source cluster schedulers, preemption is less widely used. YARN’s CapacityScheduler supports preemption for balancing the resource allocation of different queues [58], but general priority preemption is not supported in YARN, Mesos, or Kubernetes at the time of writing [60]. In an environment where tasks are not so often terminated, the design decisions of MapReduce make less sense. In the next section, we will look at some alternatives to MapReduce that make different design decisions.

超越 MapReduce

Beyond MapReduce

尽管 MapReduce 在 2000 年代末变得非常流行并受到广泛宣传,但它只是分布式系统的众多可能的编程模型之一。根据数据量、数据结构以及数据处理的类型,其他工具可能更适合表达计算。

Although MapReduce became very popular and received a lot of hype in the late 2000s, it is just one among many possible programming models for distributed systems. Depending on the volume of data, the structure of the data, and the type of processing being done with it, other tools may be more appropriate for expressing a computation.

尽管如此,我们在本章中花了很多时间讨论 MapReduce,因为它是一个有用的学习工具,因为它是分布式文件系统之上的相当清晰和简单的抽象。也就是说,简单是指能够理解它在做什么,而不是易于使用。恰恰相反:使用原始 MapReduce API 实现复杂的处理作业实际上非常困难且费力 — 例如,您需要从头开始实现任何连接算法 [ 37 ]。

We nevertheless spent a lot of time in this chapter discussing MapReduce because it is a useful learning tool, as it is a fairly clear and simple abstraction on top of a distributed filesystem. That is, simple in the sense of being able to understand what it is doing, not in the sense of being easy to use. Quite the opposite: implementing a complex processing job using the raw MapReduce APIs is actually quite hard and laborious—for instance, you would need to implement any join algorithms from scratch [37].

为了解决直接使用 MapReduce 的困难,各种高级编程模型(Pig、Hive、Cascading、Crunch)被创建为 MapReduce 之上的抽象。如果您了解 MapReduce 的工作原理,那么它们相当容易学习,并且它们的高级构造使许多常见的批处理任务更容易实现。

In response to the difficulty of using MapReduce directly, various higher-level programming models (Pig, Hive, Cascading, Crunch) were created as abstractions on top of MapReduce. If you understand how MapReduce works, they are fairly easy to learn, and their higher-level constructs make many common batch processing tasks significantly easier to implement.

然而,MapReduce 执行模型本身也存在问题,这些问题无法通过添加另一个抽象级别来解决,并且表现为某些处理的性能较差。一方面,MapReduce 非常健壮:您可以使用它在任务频繁终止的不可靠多租户系统上处理几乎任意大量的数据,并且它仍然可以完成工作(尽管速度很慢)。另一方面,对于某些类型的处理,其他工具有时要快几个数量级。

However, there are also problems with the MapReduce execution model itself, which are not fixed by adding another level of abstraction and which manifest themselves as poor performance for some kinds of processing. On the one hand, MapReduce is very robust: you can use it to process almost arbitrarily large quantities of data on an unreliable multi-tenant system with frequent task terminations, and it will still get the job done (albeit slowly). On the other hand, other tools are sometimes orders of magnitude faster for some kinds of processing.

在本章的其余部分中,我们将研究其中一些批处理的替代方案。在 第11章中,我们将转向流处理,这可以被视为加速批处理的另一种方法。

In the rest of this chapter, we will look at some of those alternatives for batch processing. In Chapter 11 we will move to stream processing, which can be regarded as another way of speeding up batch processing.

中间状态的物化

Materialization of Intermediate State

如前所述,每个 MapReduce 作业都独立于其他作业。作业与世界其他部分的主要联系点是分布式文件系统上的输入和输出目录。如果您希望一个作业的输出成为第二个作业的输入,则需要将第二个作业的输入目录配置为与第一个作业的输出目录相同,并且外部工作流调度程序必须仅在第一项工作已经完成。

As discussed previously, every MapReduce job is independent from every other job. The main contact points of a job with the rest of the world are its input and output directories on the distributed filesystem. If you want the output of one job to become the input to a second job, you need to configure the second job’s input directory to be the same as the first job’s output directory, and an external workflow scheduler must start the second job only once the first job has completed.

如果第一个作业的输出是您想要在组织内广泛发布的数据集,则此设置是合理的。在这种情况下,您需要能够按名称引用它并将其重用为多个不同作业(包括其他团队开发的作业)的输入。将数据发布到分布式文件系统中的众所周知的位置允许松散耦合,以便作业不需要知道谁在产生其输入或消耗其输出(请参阅“逻辑和接线的分离”

This setup is reasonable if the output from the first job is a dataset that you want to publish widely within your organization. In that case, you need to be able to refer to it by name and reuse it as input to several different jobs (including jobs developed by other teams). Publishing data to a well-known location in the distributed filesystem allows loose coupling so that jobs don’t need to know who is producing their input or consuming their output (see “Separation of logic and wiring”).

然而,在许多情况下,您知道一项作业的输出仅用作另一项作业的输入,而另一项作业由同一团队维护。在这种情况下,分布式文件系统上的文件只是中间状态:一种将数据从一个作业传递到下一个作业的方法。在用于构建由 50 或 100 个 MapReduce 作业 [ 29 ] 组成的推荐系统的复杂工作流程中,存在大量此类中间状态。

However, in many cases, you know that the output of one job is only ever used as input to one other job, which is maintained by the same team. In this case, the files on the distributed filesystem are simply intermediate state: a means of passing data from one job to the next. In the complex workflows used to build recommendation systems consisting of 50 or 100 MapReduce jobs [29], there is a lot of such intermediate state.

将这种中间状态写入文件的过程称为物化。(我们之前在物化视图的上下文中遇到过这个术语,在 “聚合:数据立方体和物化视图”中。它意味着急切地计算某些操作的结果并将其写出来,而不是在请求时按需计算。)

The process of writing out this intermediate state to files is called materialization. (We came across the term previously in the context of materialized views, in “Aggregation: Data Cubes and Materialized Views”. It means to eagerly compute the result of some operation and write it out, rather than computing it on demand when requested.)

相比之下,本章开头的日志分析示例使用 Unix 管道将一个命令的输出与另一个命令的输入连接起来。管道不会完全实现中间状态,而是仅使用一个小的内存缓冲区将输出增量流式传输到输入。

By contrast, the log analysis example at the beginning of the chapter used Unix pipes to connect the output of one command with the input of another. Pipes do not fully materialize the intermediate state, but instead stream the output to the input incrementally, using only a small in-memory buffer.

与 Unix 管道相比,MapReduce 完全实现中间状态的方法有缺点:

MapReduce’s approach of fully materializing intermediate state has downsides compared to Unix pipes:

  • MapReduce 作业只有在前面的作业(生成其输入)中的所有任务都完成后才能启动,而通过 Unix 管道连接的进程会同时启动,输出一产生就被消耗。不同机器上的偏差或不同的负载意味着一项作业通常有一些落后的任务,这些任务比其他任务需要更长的时间才能完成。必须等到前面作业的所有任务都完成后,整个工作流程的执行速度就会变慢。

  • A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed, whereas processes connected by a Unix pipe are started at the same time, with output being consumed as soon as it is produced. Skew or varying load on different machines means that a job often has a few straggler tasks that take much longer to complete than the others. Having to wait until all of the preceding job’s tasks have completed slows down the execution of the workflow as a whole.

  • 映射器通常是冗余的:它们只是读回刚刚由化简器写入的同一文件,并为下一阶段的分区和排序做好准备。在许多情况下,映射器代码可以是先前减速器的一部分:如果减速器输出以与映射器输出相同的方式进行分区和排序,则减速器可以直接链接在一起,而无需与映射器阶段交错。

  • Mappers are often redundant: they just read back the same file that was just written by a reducer, and prepare it for the next stage of partitioning and sorting. In many cases, the mapper code could be part of the previous reducer: if the reducer output was partitioned and sorted in the same way as mapper output, then reducers could be chained together directly, without interleaving with mapper stages.

  • 在分布式文件系统中存储中间状态意味着这些文件会跨多个节点复制,这对于此类临时数据来说通常是过大的。

  • Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data.

数据流引擎

Dataflow engines

为了解决 MapReduce 的这些问题,开发了几种用于分布式批量计算的新执行引擎,其中最著名的是 Spark [ 61 , 62 ]、Tez [ 63 , 64 ] 和 Flink [ 65 , 66 ]。它们的设计方式存在各种差异,但它们有一个共同点:它们将整个工作流程作为一项作业来处理,而不是将其分解为独立的子作业。

In order to fix these problems with MapReduce, several new execution engines for distributed batch computations were developed, the most well known of which are Spark [61, 62], Tez [63, 64], and Flink [65, 66]. There are various differences in the way they are designed, but they have one thing in common: they handle an entire workflow as one job, rather than breaking it up into independent subjobs.

由于它们通过多个处理阶段对数据流进行显式建模,因此这些系统被称为数据流引擎。与 MapReduce 一样,它们的工作方式是重复调用用户定义的函数,在单个线程上一次处理一条记录。它们通过划分输入来并行工作,并通过网络复制一个函数的输出以成为另一个函数的输入。

Since they explicitly model the flow of data through several processing stages, these systems are known as dataflow engines. Like MapReduce, they work by repeatedly calling a user-defined function to process one record at a time on a single thread. They parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function.

与 MapReduce 不同,这些函数不需要扮演严格的映射和归约交替角色,而是可以以更灵活的方式组装。我们将这些函数称为“运算符”,数据流引擎提供了几种不同的选项来将一个运算符的输出连接到另一个运算符的输入:

Unlike in MapReduce, these functions need not take the strict roles of alternating map and reduce, but instead can be assembled in more flexible ways. We call these functions operators, and the dataflow engine provides several different options for connecting one operator’s output to another’s input:

  • 一种选择是按键重新分区和排序记录,就像在 MapReduce 的 shuffle 阶段一样(请参阅“MapReduce 的分布式执行”)。此功能支持排序合并连接和分组,其方式与 MapReduce 中相同。

  • One option is to repartition and sort records by key, like in the shuffle stage of MapReduce (see “Distributed execution of MapReduce”). This feature enables sort-merge joins and grouping in the same way as in MapReduce.

  • 另一种可能性是采用多个输入并以相同的方式对它们进行分区,但跳过排序。这节省了分区哈希连接的工作量,其中记录的分区很重要,但顺序无关紧要,因为构建哈希表无论如何都会随机化顺序。

  • Another possibility is to take several inputs and to partition them in the same way, but skip the sorting. This saves effort on partitioned hash joins, where the partitioning of records is important but the order is irrelevant because building the hash table randomizes the order anyway.

  • 对于广播哈希联接,一个运算符的相同输出可以发送到联接运算符的所有分区。

  • For broadcast hash joins, the same output from one operator can be sent to all partitions of the join operator.

这种类型的处理引擎基于 Dryad [ 67 ] 和 Nephele [ 68 ] 等研究系统,与 MapReduce 模型相比,它具有以下几个优点:

This style of processing engine is based on research systems like Dryad [67] and Nephele [68], and it offers several advantages compared to the MapReduce model:

  • 诸如排序之类的昂贵工作只需要在实际需要的地方执行,而不是默认总是在每个映射和化简阶段之间进行。

  • Expensive work such as sorting need only be performed in places where it is actually required, rather than always happening by default between every map and reduce stage.

  • 没有不必要的映射任务,因为映射器完成的工作通常可以合并到前面的reduce运算符中(因为映射器不会更改数据集的分区)。

  • There are no unnecessary map tasks, since the work done by a mapper can often be incorporated into the preceding reduce operator (because a mapper does not change the partitioning of a dataset).

  • 由于工作流中的所有连接和数据依赖项都是显式声明的,因此调度程序可以了解哪里需要哪些数据,因此可以进行局部性优化。例如,它可以尝试将消耗某些数据的任务与生成数据的任务放在同一台机器上,以便可以通过共享内存缓冲区交换数据,而不必通过网络复制数据。

  • Because all joins and data dependencies in a workflow are explicitly declared, the scheduler has an overview of what data is required where, so it can make locality optimizations. For example, it can try to place the task that consumes some data on the same machine as the task that produces it, so that the data can be exchanged through a shared memory buffer rather than having to copy it over the network.

  • 通常,将运算符之间的中间状态保留在内存中或写入本地磁盘就足够了,这比将其写入 HDFS(必须将其复制到多台机器并写入每个副本上的磁盘)所需的 I/O 更少。MapReduce 已经将这种优化用于映射器输出,但数据流引擎将这一想法推广到所有中间状态。

  • It is usually sufficient for intermediate state between operators to be kept in memory or written to local disk, which requires less I/O than writing it to HDFS (where it must be replicated to several machines and written to disk on each replica). MapReduce already uses this optimization for mapper output, but dataflow engines generalize the idea to all intermediate state.

  • 操作员输入准备好后即可开始执行;无需等待整个前一阶段完成即可开始下一阶段。

  • Operators can start executing as soon as their input is ready; there is no need to wait for the entire preceding stage to finish before the next one starts.

  • 现有的 Java 虚拟机 (JVM) 进程可以重用来运行新的算子,与 MapReduce(为每个任务启动一个新的 JVM)相比,减少了启动开销。

  • Existing Java Virtual Machine (JVM) processes can be reused to run new operators, reducing startup overheads compared to MapReduce (which launches a new JVM for each task).

您可以使用数据流引擎来实现与 MapReduce 工作流相同的计算,并且由于此处描述的优化,它们的执行速度通常要快得多。由于操作符是映射和化简的泛化,因此相同的处理代码可以在任一执行引擎上运行:通过简单的配置更改,可以将在 Pig、Hive 或 Cascading 中实现的工作流从 MapReduce 切换到 Tez 或 Spark,而无需修改代码 [ 64 ]。

You can use dataflow engines to implement the same computations as MapReduce workflows, and they usually execute significantly faster due to the optimizations described here. Since operators are a generalization of map and reduce, the same processing code can run on either execution engine: workflows implemented in Pig, Hive, or Cascading can be switched from MapReduce to Tez or Spark with a simple configuration change, without modifying code [64].

Tez 是一个相当薄的库,依赖于 YARN shuffle 服务在节点之间实际复制数据 [ 58 ],而 Spark 和 Flink 是大型框架,包括自己的网络通信层、调度程序和面向用户的 API。我们将很快讨论这些高级 API。

Tez is a fairly thin library that relies on the YARN shuffle service for the actual copying of data between nodes [58], whereas Spark and Flink are big frameworks that include their own network communication layer, scheduler, and user-facing APIs. We will discuss those high-level APIs shortly.

容错能力

Fault tolerance

将中间状态完全实现到分布式文件系统的一个优点是它是持久的,这使得 MapReduce 中的容错变得相当容易:如果任务失败,它可以在另一台机器上重新启动,并从文件系统中再次读取相同的输入。

An advantage of fully materializing intermediate state to a distributed filesystem is that it is durable, which makes fault tolerance fairly easy in MapReduce: if a task fails, it can just be restarted on another machine and read the same input again from the filesystem.

Spark、Flink 和 Tez 避免将中间状态写入 HDFS,因此它们采用不同的方法来容忍故障:如果机器发生故障并且该机器上的中间状态丢失,则会根据仍然可用的其他数据(先前的数据)重新计算中间状态。如果可能,则为中间阶段,否则为原始输入数据(通常位于 HDFS 上)。

Spark, Flink, and Tez avoid writing intermediate state to HDFS, so they take a different approach to tolerating faults: if a machine fails and the intermediate state on that machine is lost, it is recomputed from other data that is still available (a prior intermediary stage if possible, or otherwise the original input data, which is normally on HDFS).

为了实现这种重新计算,框架必须跟踪给定数据的计算方式——它使用了哪些输入分区,以及对其应用了哪些运算符。Spark 使用弹性分布式数据集 (RDD) 抽象来跟踪数据的祖先 [ 61 ],而 Flink 检查操作符状态,使其能够恢复运行在执行期间遇到故障的操作符 [ 66 ]。

To enable this recomputation, the framework must keep track of how a given piece of data was computed—which input partitions it used, and which operators were applied to it. Spark uses the resilient distributed dataset (RDD) abstraction for tracking the ancestry of data [61], while Flink checkpoints operator state, allowing it to resume running an operator that ran into a fault during its execution [66].

重新计算数据时,重要的是要知道计算是否是确定性的:也就是说,给定相同的输入数据,运算符是否总是产生相同的输出?如果一些丢失的数据已经发送给下游运营商,这个问题就很重要。如果算子重启,重新计算的数据与原来丢失的数据不一样,下游算子就很难解决新旧数据之间的矛盾。对于非确定性运算符的解决方案通常也是杀死下游运算符,然后在新数据上再次运行它们。

When recomputing data, it is important to know whether the computation is deterministic: that is, given the same input data, do the operators always produce the same output? This question matters if some of the lost data has already been sent to downstream operators. If the operator is restarted and the recomputed data is not the same as the original lost data, it becomes very hard for downstream operators to resolve the contradictions between the old and new data. The solution in the case of nondeterministic operators is normally to kill the downstream operators as well, and run them again on the new data.

为了避免此类级联故障,最好让算子具有确定性。但请注意,非确定性行为很容易意外渗透:例如,许多编程语言在迭代哈希表的元素时不保证任何特定顺序,许多概率和统计算法明确依赖于使用随机数,以及任何 使用系统时钟或外部数据源的变化是不确定的。需要消除此类不确定性原因,以便可靠地从故障中恢复,例如通过使用固定种子生成伪随机数。

In order to avoid such cascading faults, it is better to make operators deterministic. Note however that it is easy for nondeterministic behavior to accidentally creep in: for example, many programming languages do not guarantee any particular order when iterating over elements of a hash table, many probabilistic and statistical algorithms explicitly rely on using random numbers, and any use of the system clock or external data sources is nondeterministic. Such causes of nondeterminism need to be removed in order to reliably recover from faults, for example by generating pseudorandom numbers using a fixed seed.

通过重新计算数据来从故障中恢复并不总是正确的答案:如果中间数据比源数据小得多,或者如果计算非常占用 CPU 资源,那么将中间数据具体化到文件可能比重新计算它更便宜。

Recovering from faults by recomputing data is not always the right answer: if the intermediate data is much smaller than the source data, or if the computation is very CPU-intensive, it is probably cheaper to materialize the intermediate data to files than to recompute it.

物化的讨论

Discussion of materialization

回到 Unix 类比,我们看到 MapReduce 就像将每个命令的输出写入临时文件,而数据流引擎看起来更像 Unix 管道。Flink 尤其是围绕管道执行的理念构建的:即将一个运算符的输出增量传递给其他运算符,而不是等待输入完成才开始处理它。

Returning to the Unix analogy, we saw that MapReduce is like writing the output of each command to a temporary file, whereas dataflow engines look much more like Unix pipes. Flink especially is built around the idea of pipelined execution: that is, incrementally passing the output of an operator to other operators, and not waiting for the input to be complete before starting to process it.

排序操作不可避免地需要消耗其整个输入才能产生任何输出,因为最后一个输入记录可能是具有最低键的记录,因此需要成为第一个输出记录。因此,任何需要排序的运算符都需要积累状态,至少是暂时的。但工作流的许多其他部分可以以管道方式执行。

A sorting operation inevitably needs to consume its entire input before it can produce any output, because it’s possible that the very last input record is the one with the lowest key and thus needs to be the very first output record. Any operator that requires sorting will thus need to accumulate state, at least temporarily. But many other parts of a workflow can be executed in a pipelined manner.

当作业完成时,它的输出需要保存到持久的地方,以便用户可以找到它并使用它 - 最有可能的是,它会再次写入分布式文件系统。因此,当使用数据流引擎时,HDFS 上的物化数据集通常仍然是作业的输入和最终输出。与 MapReduce 一样,输入是不可变的,并且输出被完全替换。相对于 MapReduce 的改进在于您无需将所有中间状态写入文件系统。

When the job completes, its output needs to go somewhere durable so that users can find it and use it—most likely, it is written to the distributed filesystem again. Thus, when using a dataflow engine, materialized datasets on HDFS are still usually the inputs and the final outputs of a job. Like with MapReduce, the inputs are immutable and the output is completely replaced. The improvement over MapReduce is that you save yourself writing all the intermediate state to the filesystem as well.

图和迭代处理

Graphs and Iterative Processing

“类图数据模型”中,我们讨论了使用图来建模数据,以及使用图查询语言来遍历图中的边和顶点。第 2 章中的讨论主要围绕 OLTP 样式的使用:快速执行查询以查找匹配特定条件的少量顶点。

In “Graph-Like Data Models” we discussed using graphs for modeling data, and using graph query languages to traverse the edges and vertices in a graph. The discussion in Chapter 2 was focused around OLTP-style use: quickly executing queries to find a small number of vertices matching certain criteria.

在批处理上下文中查看图形也很有趣,其目标是对整个图形执行某种离线处理或分析。这种需求经常出现在推荐引擎等机器学习应用程序或排名系统中。例如,最著名的图形分析算法之一是 PageRank [ 69 ],它试图根据其他网页链接到该网页的内容来估计该网页的受欢迎程度。它用作确定网络搜索引擎显示结果的顺序的公式的一部分。

It is also interesting to look at graphs in a batch processing context, where the goal is to perform some kind of offline processing or analysis on an entire graph. This need often arises in machine learning applications such as recommendation engines, or in ranking systems. For example, one of the most famous graph analysis algorithms is PageRank [69], which tries to estimate the popularity of a web page based on what other web pages link to it. It is used as part of the formula that determines the order in which web search engines present their results.

笔记

Spark、Flink 和 Tez(请参阅“中间状态的具体化”)等数据流引擎通常将作业中的运算符排列为有向无环图 (DAG)。这与图形处理不同:在数据流引擎中,从一个运算符到另一个运算符的数据流被构造为图形,而数据本身通常由关系型元组组成。在图处理中,数据本身具有图的形式。另一个不幸的命名混乱!

Dataflow engines like Spark, Flink, and Tez (see “Materialization of Intermediate State”) typically arrange the operators in a job as a directed acyclic graph (DAG). This is not the same as graph processing: in dataflow engines, the flow of data from one operator to another is structured as a graph, while the data itself typically consists of relational-style tuples. In graph processing, the data itself has the form of a graph. Another unfortunate naming confusion!

许多图算法的表达方式是一次遍历一条边,将一个顶点与相邻顶点连接起来以传播一些信息,然后重复直到满足某些条件 - 例如,直到没有更多的边可以跟随,或者直到某些度量收敛。我们在图 2-6中看到了一个例子,它通过重复跟踪指示哪个位置位于哪个其他位置内的边来列出数据库中包含的北美所有位置(这种算法称为传递闭包

Many graph algorithms are expressed by traversing one edge at a time, joining one vertex with an adjacent vertex in order to propagate some information, and repeating until some condition is met—for example, until there are no more edges to follow, or until some metric converges. We saw an example in Figure 2-6, which made a list of all the locations in North America contained in a database by repeatedly following edges indicating which location is within which other location (this kind of algorithm is called a transitive closure).

可以将图存储在分布式文件系统中(在包含顶点和边列表的文件中),但是这种“重复直到完成”的想法不能用普通的 MapReduce 来表达,因为它只对数据执行一次传递。因此,这种算法通常以 迭代方式实现:

It is possible to store a graph in a distributed filesystem (in files containing lists of vertices and edges), but this idea of “repeating until done” cannot be expressed in plain MapReduce, since it only performs a single pass over the data. This kind of algorithm is thus often implemented in an iterative style:

  1. 外部调度程序运行批处理来计算算法的一个步骤。

  2. An external scheduler runs a batch process to calculate one step of the algorithm.

  3. 当批处理完成时,调度程序检查它是否已完成(基于完成条件,例如,没有更多的边可遵循,或者与上次迭代相比的变化低于某个阈值)。

  4. When the batch process completes, the scheduler checks whether it has finished (based on the completion condition—e.g., there are no more edges to follow, or the change compared to the last iteration is below some threshold).

  5. 如果尚未完成,调度程序将返回步骤 1 并运行另一轮批处理过程。

  6. If it has not yet finished, the scheduler goes back to step 1 and runs another round of the batch process.

这种方法是有效的,但是用 MapReduce 实现它通常效率非常低,因为 MapReduce 没有考虑到算法的迭代性质:它总是会读取整个输入数据集并生成一个全新的输出数据集,即使只是一小部分。与上次迭代相比,图表发生了变化。

This approach works, but implementing it with MapReduce is often very inefficient, because MapReduce does not account for the iterative nature of the algorithm: it will always read the entire input dataset and produce a completely new output dataset, even if only a small part of the graph has changed compared to the last iteration.

Pregel 加工模型

The Pregel processing model

作为批处理图的优化,批量同步并行(BSP)计算模型[ 70 ]已经变得流行。其中,它是由 Apache Giraph [ 37 ]、Spark 的 GraphX API 和 Flink 的 Gelly API [ 71 ] 实现的。它也被称为Pregel模型,因为 Google 的 Pregel 论文推广了这种处理图的方法 [ 72 ]。

As an optimization for batch processing graphs, the bulk synchronous parallel (BSP) model of computation [70] has become popular. Among others, it is implemented by Apache Giraph [37], Spark’s GraphX API, and Flink’s Gelly API [71]. It is also known as the Pregel model, as Google’s Pregel paper popularized this approach for processing graphs [72].

回想一下,在 MapReduce 中,映射器在概念上“发送消息”到减速器的特定调用,因为框架将具有相同键的所有映射器输出收集在一起。Pregel 背后也有类似的想法:一个顶点可以“发送消息”到另一个顶点,并且通常这些消息沿着图中的边发送。

Recall that in MapReduce, mappers conceptually “send a message” to a particular call of the reducer because the framework collects together all the mapper outputs with the same key. A similar idea is behind Pregel: one vertex can “send a message” to another vertex, and typically those messages are sent along the edges in a graph.

在每次迭代中,都会为每个顶点调用一个函数,将发送给它的所有消息传递给它 - 非常类似于对减速器的调用。与 MapReduce 的区别在于,在 Pregel 模型中,顶点从一次迭代到下一次迭代都会记住其在内存中的状态,因此该函数只需要处理新传入的消息。如果图表的某些部分没有发送消息,则无需执行任何操作。

In each iteration, a function is called for each vertex, passing it all the messages that were sent to it—much like a call to the reducer. The difference from MapReduce is that in the Pregel model, a vertex remembers its state in memory from one iteration to the next, so the function only needs to process new incoming messages. If no messages are being sent in some part of the graph, no work needs to be done.

如果您将每个顶点视为一个参与者,那么它有点类似于参与者模型(请参阅“分布式参与者框架”),除了顶点状态和顶点之间的消息是容错且持久的,并且通信以固定轮次进行:每次迭代,框架都会传递上一次迭代中发送的所有消息。演员通常没有这样的时间保证。

It’s a bit similar to the actor model (see “Distributed actor frameworks”), if you think of each vertex as an actor, except that vertex state and messages between vertices are fault-tolerant and durable, and communication proceeds in fixed rounds: at every iteration, the framework delivers all messages sent in the previous iteration. Actors normally have no such timing guarantee.

容错能力

Fault tolerance

顶点只能通过消息传递(而不是直接相互查询)进行通信,这一事实有助于提高 Pregel 作业的性能,因为消息可以批量处理并且等​​待通信的时间更少。唯一的等待是在迭代之间:由于 Pregel 模型保证一次迭代中发送的所有消息都在下一次迭代中传递,因此前一次迭代必须完全完成,并且其所有消息必须在下一次迭代之前通过网络复制。开始。

The fact that vertices can only communicate by message passing (not by querying each other directly) helps improve the performance of Pregel jobs, since messages can be batched and there is less waiting for communication. The only waiting is between iterations: since the Pregel model guarantees that all messages sent in one iteration are delivered in the next iteration, the prior iteration must completely finish, and all of its messages must be copied over the network, before the next one can start.

即使底层网络可能会丢弃、重复或任意延迟消息(请参阅 “不可靠的网络”),Pregel 实现也能保证消息在接下来的迭代中在其目标顶点处处理一次。与 MapReduce 一样,该框架透明地从故障中恢复,以简化 Pregel 之上算法的编程模型。

Even though the underlying network may drop, duplicate, or arbitrarily delay messages (see “Unreliable Networks”), Pregel implementations guarantee that messages are processed exactly once at their destination vertex in the following iteration. Like MapReduce, the framework transparently recovers from faults in order to simplify the programming model for algorithms on top of Pregel.

这种容错能力是通过在迭代结束时定期检查所有顶点的状态来实现的,即将它们的完整状态写入持久存储。如果节点发生故障并且其内存状态丢失,最简单的解决方案是将整个图计算回滚到最后一个检查点并重新启动计算。如果算法是确定性的并且记录了消息,则还可以有选择地仅恢复丢失的分区(就像我们之前讨论的数据流引擎一样)[ 72 ]。

This fault tolerance is achieved by periodically checkpointing the state of all vertices at the end of an iteration—i.e., writing their full state to durable storage. If a node fails and its in-memory state is lost, the simplest solution is to roll back the entire graph computation to the last checkpoint and restart the computation. If the algorithm is deterministic and messages are logged, it is also possible to selectively recover only the partition that was lost (like we previously discussed for dataflow engines) [72].

并行执行

Parallel execution

顶点不需要知道它正在哪台物理机器上执行;当它向其他顶点发送消息时,它只是将它们发送到一个顶点 ID。由框架来对图进行分区,即决定哪个顶点在哪台机器上运行,以及如何通过网络路由消息以便它们最终到达正确的位置。

A vertex does not need to know on which physical machine it is executing; when it sends messages to other vertices, it simply sends them to a vertex ID. It is up to the framework to partition the graph—i.e., to decide which vertex runs on which machine, and how to route messages over the network so that they end up in the right place.

由于编程模型一次仅处理一个顶点(有时称为“像顶点一样思考”),因此框架可以以任意方式对图进行分区。理想情况下,如果顶点需要大量通信,则应对其进行分区,以便顶点位于同一台机器上。然而,找到这样一个优化的分区是很困难的——在实践中,图通常只是简单地通过任意分配的顶点 ID 进行分区,而不尝试将相关的顶点分组在一起。

Because the programming model deals with just one vertex at a time (sometimes called “thinking like a vertex”), the framework may partition the graph in arbitrary ways. Ideally it would be partitioned such that vertices are colocated on the same machine if they need to communicate a lot. However, finding such an optimized partitioning is hard—in practice, the graph is often simply partitioned by an arbitrarily assigned vertex ID, making no attempt to group related vertices together.

因此,图算法往往具有大量的跨机通信开销,并且中间状态(节点之间发送的消息)往往比原始图更大。通过网络发送消息的开销会显着减慢分布式图算法的速度。

As a result, graph algorithms often have a lot of cross-machine communication overhead, and the intermediate state (messages sent between nodes) is often bigger than the original graph. The overhead of sending messages over the network can significantly slow down distributed graph algorithms.

因此,如果您的图可以容纳在单台计算机的内存中,那么单机(甚至可能是单线程)算法很可能会优于分布式批处理过程 [ 73 , 74 ]。即使图比内存大,它也可以放在单台计算机的磁盘上,使用 GraphChi 等框架进行单机处理是一个可行的选择[75 ]。如果图太大而无法在单台机器上容纳,那么诸如 Pregel 之类的分布式方法是不可避免的;有效地并行化图算法是一个正在进行的研究领域[ 76 ]。

For this reason, if your graph can fit in memory on a single computer, it’s quite likely that a single-machine (maybe even single-threaded) algorithm will outperform a distributed batch process [73, 74]. Even if the graph is bigger than memory, it can fit on the disks of a single computer, single-machine processing using a framework such as GraphChi is a viable option [75]. If the graph is too big to fit on a single machine, a distributed approach such as Pregel is unavoidable; efficiently parallelizing graph algorithms is an area of ongoing research [76].

高级 API 和语言

High-Level APIs and Languages

自MapReduce首次流行以来,分布式批处理的执行引擎已经成熟。到目前为止,基础设施已经足够强大,可以在超过 10,000 台机器的集群上存储和处理数 PB 的数据。由于这种规模的物理操作批处理过程的问题已被认为或多或少已经得到解决,人们的注意力已转向其他领域:改进编程模型、提高处理效率以及扩大这些技术可以解决的问题范围。

Over the years since MapReduce first became popular, the execution engines for distributed batch processing have matured. By now, the infrastructure has become robust enough to store and process many petabytes of data on clusters of over 10,000 machines. As the problem of physically operating batch processes at such scale has been considered more or less solved, attention has turned to other areas: improving the programming model, improving the efficiency of processing, and broadening the set of problems that these technologies can solve.

如前所述,Hive、Pig、Cascading 和 Crunch 等高级语言和 API 变得流行,因为手动编写 MapReduce 作业非常费力。随着 Tez 的出现,这些高级语言具有额外的好处,即能够迁移到新的数据流执行引擎,而无需重写作业代码。Spark 和 Flink 还包含自己的高级数据流 API,通常从 FlumeJava 中汲取灵感 [ 34 ]。

As discussed previously, higher-level languages and APIs such as Hive, Pig, Cascading, and Crunch became popular because programming MapReduce jobs by hand is quite laborious. As Tez emerged, these high-level languages had the additional benefit of being able to move to the new dataflow execution engine without the need to rewrite job code. Spark and Flink also include their own high-level dataflow APIs, often taking inspiration from FlumeJava [34].

这些数据流 API 通常使用关系型构建块来表达计算:根据某个字段的值连接数据集;按键对元组进行分组;按某种条件过滤;通过计数、求和或其他函数来聚合元组。在内部,这些操作是使用我们在本章前面讨论的各种连接和分组算法来实现的。

These dataflow APIs generally use relational-style building blocks to express a computation: joining datasets on the value of some field; grouping tuples by key; filtering by some condition; and aggregating tuples by counting, summing, or other functions. Internally, these operations are implemented using the various join and grouping algorithms that we discussed earlier in this chapter.

除了需要更少代码的明显优势之外,这些高级接口还允许交互式使用,您可以在 shell 中增量编写分析代码并经常运行它以观察它正在做什么。在探索数据集并尝试处理数据集的方法时,这种开发方式非常有帮助。这也让人想起我们在《Unix 哲学》中讨论过的 Unix 哲学。

Besides the obvious advantage of requiring less code, these high-level interfaces also allow interactive use, in which you write analysis code incrementally in a shell and run it frequently to observe what it is doing. This style of development is very helpful when exploring a dataset and experimenting with approaches for processing it. It is also reminiscent of the Unix philosophy, which we discussed in “The Unix Philosophy”.

此外,这些高级接口不仅使使用系统的人员更加高效,而且还提高了机器级别的作业执行效率。

Moreover, these high-level interfaces not only make the humans using the system more productive, but they also improve the job execution efficiency at a machine level.

向声明式查询语言的转变

The move toward declarative query languages

与拼写出执行连接的代码相比,将连接指定为关系运算符的优点是框架可以分析连接输入的属性并自动决定上述哪种连接算法最适合当前的任务。Hive、 Spark和Flink 具有基于成本的查询优化器,可以做到这一点,甚至可以 更改连接 顺序,以便最大限度地减少中间状态的数量 [66、77、78、79 ]

An advantage of specifying joins as relational operators, compared to spelling out the code that performs the join, is that the framework can analyze the properties of the join inputs and automatically decide which of the aforementioned join algorithms would be most suitable for the task at hand. Hive, Spark, and Flink have cost-based query optimizers that can do this, and even change the order of joins so that the amount of intermediate state is minimized [66, 77, 78, 79].

连接算法的选择可以对批处理作业的性能产生很大的影响,并且不必理解和记住我们在本章中讨论的所有各种连接算法,这很好。如果以声明性方式指定连接,则这是可能的:应用程序只需说明需要哪些连接,而查询优化器则决定如何最好地执行它们。我们之前在“数据查询语言”中遇到过这个想法。

The choice of join algorithm can make a big difference to the performance of a batch job, and it is nice not to have to understand and remember all the various join algorithms we discussed in this chapter. This is possible if joins are specified in a declarative way: the application simply states which joins are required, and the query optimizer decides how they can best be executed. We previously came across this idea in “Query Languages for Data”.

然而,在其他方面,MapReduce 及其数据流后继者与 SQL 的完全声明式查询模型有很大不同。MapReduce 是围绕函数回调的思想构建的:对于每个记录或记录组,调用用户定义的函数(映射器或减速器),并且该函数可以自由调用任意代码来决定输出什么。这种方法的优点是,您可以利用现有库的大型生态系统来执行解析、自然语言分析、图像分析以及运行数值或统计算法等操作。

However, in other ways, MapReduce and its dataflow successors are very different from the fully declarative query model of SQL. MapReduce was built around the idea of function callbacks: for each record or group of records, a user-defined function (the mapper or reducer) is called, and that function is free to call arbitrary code in order to decide what to output. This approach has the advantage that you can draw upon a large ecosystem of existing libraries to do things like parsing, natural language analysis, image analysis, and running numerical or statistical algorithms.

轻松运行任意代码的自由性是 MapReduce 传统批处理系统与 MPP 数据库的长期区别(请参阅“Hadoop 与分布式数据库的比较”);尽管数据库具有编写用户定义函数的功能,但它们通常使用起来很麻烦,并且与大多数编程语言中广泛使用的包管理器和依赖关系管理系统(例如用于 Java 的 Maven、用于 JavaScript 的 npm 和 Rubygems)不能很好地集成对于红宝石)。

The freedom to easily run arbitrary code is what has long distinguished batch processing systems of MapReduce heritage from MPP databases (see “Comparing Hadoop to Distributed Databases”); although databases have facilities for writing user-defined functions, they are often cumbersome to use and not well integrated with the package managers and dependency management systems that are widely used in most programming languages (such as Maven for Java, npm for JavaScript, and Rubygems for Ruby).

然而,数据流引擎发现,除了连接之外,在其他区域合并更多声明性功能也有优势。例如,如果回调函数仅包含简单的过滤条件,或者仅从记录中选择一些字段,则在每条记录上调用该函数会产生大量的 CPU 开销。如果这种简单的过滤和映射操作以声明性方式表达,则查询优化器可以利用面向列的存储布局(请参阅“面向列的存储”)并从磁盘中仅读取所需的列。Hive、Spark DataFrames 和 Impala 也使用矢量化执行(请参阅 “内存带宽和矢量化处理”):在对 CPU 缓存友好的紧密内循环中迭代数据,并避免函数调用。Spark 生成 JVM 字节码 [ 79 ],Impala 使用 LLVM 为这些内部循环生成本机代码 [ 41 ]。

However, dataflow engines have found that there are also advantages to incorporating more declarative features in areas besides joins. For example, if a callback function contains only a simple filtering condition, or it just selects some fields from a record, then there is significant CPU overhead in calling the function on every record. If such simple filtering and mapping operations are expressed in a declarative way, the query optimizer can take advantage of column-oriented storage layouts (see “Column-Oriented Storage”) and read only the required columns from disk. Hive, Spark DataFrames, and Impala also use vectorized execution (see “Memory bandwidth and vectorized processing”): iterating over data in a tight inner loop that is friendly to CPU caches, and avoiding function calls. Spark generates JVM bytecode [79] and Impala uses LLVM to generate native code for these inner loops [41].

通过将声明性方面合并到其高级 API 中,并拥有可以在执行期间利用它们的查询优化器,批处理框架开始看起来更像 MPP 数据库(并且可以实现相当的性能)。同时,由于具有能够运行任意代码和读取任意格式数据的可扩展性,它们保留了灵活性优势。

By incorporating declarative aspects in their high-level APIs, and having query optimizers that can take advantage of them during execution, batch processing frameworks begin to look more like MPP databases (and can achieve comparable performance). At the same time, by having the extensibility of being able to run arbitrary code and read data in arbitrary formats, they retain their flexibility advantage.

针对不同领域的专业化

Specialization for different domains

虽然能够运行任意代码的可扩展性很有用,但在许多常见情况下,标准处理模式不断重复出现,因此值得拥有通用构建块的可重用实现。传统上,MPP 数据库满足了商业智能分析师和商业报告的需求,但这只是使用批处理的众多领域之一。

While the extensibility of being able to run arbitrary code is useful, there are also many common cases where standard processing patterns keep reoccurring, and so it is worth having reusable implementations of the common building blocks. Traditionally, MPP databases have served the needs of business intelligence analysts and business reporting, but that is just one among many domains in which batch processing is used.

另一个日益重要的领域是统计和数值算法,这是机器学习应用(例如分类和推荐系统)所需要的。可重用的实现正在不断涌现:例如,Mahout 在 MapReduce、Spark 和 Flink 之上实现了各种机器学习算法,而 MADlib 在关系 MPP 数据库 (Apache HAWQ) 中实现了类似的功能 [54 ]

Another domain of increasing importance is statistical and numerical algorithms, which are needed for machine learning applications such as classification and recommendation systems. Reusable implementations are emerging: for example, Mahout implements various algorithms for machine learning on top of MapReduce, Spark, and Flink, while MADlib implements similar functionality inside a relational MPP database (Apache HAWQ) [54].

同样有用的是空间算法,例如k-近邻 [ 80 ],它在某个多维空间中搜索与给定项目接近的项目 - 一种相似性搜索。近似搜索对于基因组分析算法也很重要,它需要查找相似但不相同的字符串[ 81 ]。

Also useful are spatial algorithms such as k-nearest neighbors [80], which searches for items that are close to a given item in some multi-dimensional space—a kind of similarity search. Approximate search is also important for genome analysis algorithms, which need to find strings that are similar but not identical [81].

批处理引擎被用于分布式执行来自越来越广泛的领域的算法。随着批处理系统获得内置功能和高级声明性运算符,并且随着 MPP 数据库变得更加可编程和灵活,两者开始看起来越来越相似:最终,它们都只是用于存储和处理数据的系统。

Batch processing engines are being used for distributed execution of algorithms from an increasingly wide range of domains. As batch processing systems gain built-in functionality and high-level declarative operators, and as MPP databases become more programmable and flexible, the two are beginning to look more alike: in the end, they are all just systems for storing and processing data.

概括

Summary

在本章中,我们探讨了批处理的主题。我们首先研究了 Unix 工具,例如awkgrep、 和sort,然后我们看到了这些工具的设计理念如何被运用到 MapReduce 和更新的数据流引擎中。其中一些设计原则是输入是不可变的,输出旨在成为另一个(尚未未知)程序的输入,并且通过组合“做好一件事”的小工具来解决复杂的问题。

In this chapter we explored the topic of batch processing. We started by looking at Unix tools such as awk, grep, and sort, and we saw how the design philosophy of those tools is carried forward into MapReduce and more recent dataflow engines. Some of those design principles are that inputs are immutable, outputs are intended to become the input to another (as yet unknown) program, and complex problems are solved by composing small tools that “do one thing well.”

在Unix世界中,允许一个程序与另一个程序组合的统一接口是文件和管道;在 MapReduce 中,该接口是分布式文件系统。我们看到数据流引擎添加了自己的类似管道的数据传输机制,以避免将中间状态具体化到分布式文件系统,但作业的初始输入和最终输出通常仍然是 HDFS。

In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed filesystem. We saw that dataflow engines add their own pipe-like data transport mechanisms to avoid materializing intermediate state to the distributed filesystem, but the initial input and final output of a job is still usually HDFS.

分布式批处理框架需要解决的两个主要问题是:

The two main problems that distributed batch processing frameworks need to solve are:

分区
Partitioning

在 MapReduce 中,映射器根据输入文件块进行分区。映射器的输出被重新分区、排序并合并到可配置数量的减速器分区中。此过程的目的是将所有相关数据(例如,具有相同密钥的所有记录)放在同一位置。

Post-MapReduce 数据流引擎尝试避免排序,除非需要,但它们在其他方面采用大致相似的分区方法。

In MapReduce, mappers are partitioned according to input file blocks. The output of mappers is repartitioned, sorted, and merged into a configurable number of reducer partitions. The purpose of this process is to bring all the related data—e.g., all the records with the same key—together in the same place.

Post-MapReduce dataflow engines try to avoid sorting unless it is required, but they otherwise take a broadly similar approach to partitioning.

容错能力
Fault tolerance

MapReduce 频繁写入磁盘,这使得可以轻松地从单个失败任务中恢复,而无需重新启动整个作业,但会减慢无故障情况下的执行速度。数据流引擎执行较少的中间状态具体化,并将更多的中间状态保留在内存中,这意味着如果节点发生故障,它们需要重新计算更多的数据。确定性运算符减少了需要重新计算的数据量。

MapReduce frequently writes to disk, which makes it easy to recover from an individual failed task without restarting the entire job but slows down execution in the failure-free case. Dataflow engines perform less materialization of intermediate state and keep more in memory, which means that they need to recompute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed.

我们讨论了 MapReduce 的几种连接算法,其中大多数也在 MPP 数据库和数据流引擎内部使用。它们还很好地说明了分区算法的工作原理:

We discussed several join algorithms for MapReduce, most of which are also internally used in MPP databases and dataflow engines. They also provide a good illustration of how partitioned algorithms work:

排序合并连接
Sort-merge joins

被连接的每个输入都会经过一个提取连接键的映射器。通过分区、排序和合并,具有相同键的所有记录最终都会进入减速器的同一个调用。然后该函数可以输出连接的记录。

Each of the inputs being joined goes through a mapper that extracts the join key. By partitioning, sorting, and merging, all the records with the same key end up going to the same call of the reducer. This function can then output the joined records.

广播哈希连接
Broadcast hash joins

两个连接输入之一很小,因此它没有分区,可以完全加载到哈希表中。因此,您可以为大连接输入的每个分区启动一个映射器,将小输入的哈希表加载到每个映射器中,然后一次扫描一条记录,在哈希表中查询每条记录。

One of the two join inputs is small, so it is not partitioned and it can be entirely loaded into a hash table. Thus, you can start a mapper for each partition of the large join input, load the hash table for the small input into each mapper, and then scan over the large input one record at a time, querying the hash table for each record.

分区哈希连接
Partitioned hash joins

如果两个连接输入以相同的方式进行分区(使用相同的键、相同的哈希函数和相同的分区数量),则可以对每个分区独立使用哈希表方法。

If the two join inputs are partitioned in the same way (using the same key, same hash function, and same number of partitions), then the hash table approach can be used independently for each partition.

分布式批处理引擎有一个故意限制的编程模型:回调函数(例如映射器和减速器)被假定为无状态的,并且除了指定的输出之外没有外部可见的副作用。这种限制允许框架将一些分布式系统的硬问题隐藏在其抽象背后:面对崩溃和网络问题,任务可以安全地重试,并且任何失败任务的输出都将被丢弃。如果某个分区的多个任务成功,则只有其中一个任务实际使其输出可见。

Distributed batch processing engines have a deliberately restricted programming model: callback functions (such as mappers and reducers) are assumed to be stateless and to have no externally visible side effects besides their designated output. This restriction allows the framework to hide some of the hard distributed systems problems behind its abstraction: in the face of crashes and network issues, tasks can be retried safely, and the output from any failed tasks is discarded. If several tasks for a partition succeed, only one of them actually makes its output visible.

借助该框架,您在批处理作业中的代码无需担心实现容错机制:该框架可以保证作业的最终输出与没有发生错误一样,即使实际上存在各种错误也许必须重试任务。这些可靠的语义比通常处理用户请求以及作为处理请求的副作用写入数据库的在线服务中的语义要强大得多。

Thanks to the framework, your code in a batch processing job does not need to worry about implementing fault-tolerance mechanisms: the framework can guarantee that the final output of a job is the same as if no faults had occurred, even though in reality various tasks perhaps had to be retried. These reliable semantics are much stronger than what you usually have in online services that handle user requests and that write to databases as a side effect of processing a request.

批处理作业的显着特征是它读取一些输入数据并生成一些输出数据,而不修改输入,换句话说,输出是从输入派生的。至关重要的是,输入数据是有限的:它具有已知的固定大小(例如,它由某个时间点的一组日志文件或数据库内容的快照组成)。因为它是有界的,所以作业知道它何时完成读取整个输入,因此作业最终会在完成时完成。

The distinguishing feature of a batch processing job is that it reads some input data and produces some output data, without modifying the input—in other words, the output is derived from the input. Crucially, the input data is bounded: it has a known, fixed size (for example, it consists of a set of log files at some point in time, or a snapshot of a database’s contents). Because it is bounded, a job knows when it has finished reading the entire input, and so a job eventually completes when it is done.

在下一章中,我们将转向流处理,其中输入是无限的——也就是说,你仍然有一份工作,但它的输入是永无止境的数据流。在这种情况下,作业永远不会完成,因为随时可能还会有更多的工作进来。我们将看到流处理和批处理在某些方面是相似的,但是无界流的假设也改变了我们如何处理的方式。构建系统。

In the next chapter, we will turn to stream processing, in which the input is unbounded—that is, you still have a job, but its inputs are never-ending streams of data. In this case, a job is never complete, because at any time there may still be more work coming in. We shall see that stream and batch processing are similar in some respects, but the assumption of unbounded streams also changes a lot about how we build systems.

脚注

有些人喜欢指出这 cat是不必要的,因为输入文件可以直接作为 awk. 然而,这样写的话,线性流水线就更明显了。

i Some people love to point out that cat is unnecessary here, as the input file could be given directly as an argument to awk. However, the linear pipeline is more apparent when written like this.

二、统一接口的另一个例子是 URL 和 HTTP,它们是 Web 的基础。URL 标识网站上的特定事物(资源),您可以链接到任何其他网站的任何 URL。因此,使用网络浏览器的用户可以通过链接在网站之间无缝跳转,即使服务器可能由完全不相关的组织运营。这个原则在今天看来是显而易见的,但它是使网络取得今天成功的关键见解。以前的系统并不那么统一:例如,在公告板系统(BBS)时代,每个系统都有自己的电话号码和波特率配置。从一个 BBS 到另一个 BBS 的引用必须采用电话号码和调制解调器设置的形式;用户必须挂断电话,拨打其他 BBS,然后手动查找他们要查找的信息。

ii Another example of a uniform interface is URLs and HTTP, the foundations of the web. A URL identifies a particular thing (resource) on a website, and you can link to any URL from any other website. A user with a web browser can thus seamlessly jump between websites by following links, even though the servers may be operated by entirely unrelated organizations. This principle seems obvious today, but it was a key insight in making the web the success that it is today. Prior systems were not so uniform: for example, in the era of bulletin board systems (BBSs), each system had its own phone number and baud rate configuration. A reference from one BBS to another would have to be in the form of a phone number and modem settings; the user would have to hang up, dial the other BBS, and then manually find the information they were looking for. It wasn’t possible to link directly to some piece of content inside another BBS.

iii除非使用单独的工具,例如netcatcurl。Unix 开始尝试将所有内容表示为文件,但 BSD 套接字 API 偏离了该约定 [ 17 ]。研究操作系统Plan 9Inferno/net/tcp在文件的使用上更加一致:它们将 TCP 连接表示为 [ 18 ]中的文件。

iii Except by using a separate tool, such as netcat or curl. Unix started out trying to represent everything as files, but the BSD sockets API deviated from that convention [17]. The research operating systems Plan 9 and Inferno are more consistent in their use of files: they represent a TCP connection as a file in /net/tcp [18].

iv一个区别是,使用 HDFS,计算任务可以安排在存储特定文件副本的计算机上运行,​​而对象存储通常将存储和计算分开。如果网络带宽是瓶颈,则从本地磁盘读取具有性能优势。但请注意,如果使用纠删码,就会失去局部性优势,因为必须组合来自多台机器的数据才能重建原始文件[ 20 ]。

iv One difference is that with HDFS, computing tasks can be scheduled to run on the machine that stores a copy of a particular file, whereas object stores usually keep storage and computation separate. Reading from a local disk has a performance advantage if network bandwidth is a bottleneck. Note however that if erasure coding is used, the locality advantage is lost, because the data from several machines must be combined in order to reconstitute the original file [20].

v我们在本书中讨论的连接通常是等连接,这是最常见的连接类型,其中一条记录与在特定字段(例如 ID)中具有相同值的其他记录相关联。一些数据库支持更通用的连接类型,例如使用小于运算符而不是相等运算符,但我们没有空间在这里介绍它们。

v The joins we talk about in this book are generally equi-joins, the most common type of join, in which a record is associated with other records that have an identical value in a particular field (such as an ID). Some databases support more general types of joins, for example using a less-than operator instead of an equality operator, but we do not have space to cover them here.

vi此示例假设哈希表中的每个键都只有一个条目,这对于用户数据库来说可能是正确的(用户 ID 唯一标识用户)。一般来说,哈希表可能需要包含具有相同键的多个条目,并且连接运算符将输出某个键的所有匹配项。

vi This example assumes that there is exactly one entry for each key in the hash table, which is probably true with a user database (a user ID uniquely identifies a user). In general, the hash table may need to contain several entries with the same key, and the join operator will output all matches for a key.

参考

[ 1 ] Jeffrey Dean 和 Sanjay Ghemawat:“ MapReduce:大型集群上的简化数据处理”,第六届 USENIX 操作系统设计和实现(OSDI) 研讨会,2004 年 12 月。

[1] Jeffrey Dean and Sanjay Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

[ 2 ] Joel Spolsky:“ JavaSchools 的危险”,joelonsoftware.com,2005 年 12 月 25 日。

[2] Joel Spolsky: “The Perils of JavaSchools,” joelonsoftware.com, December 25, 2005.

[ 3 ] Shivnath Babu 和 Herodotos Herodotou:“大规模并行数据库和 MapReduce 系统”,数据库基础与趋势,第 5 卷,第 1 期,第 1-104 页,2013 年 11 月 。doi:10.1561/1900000036

[3] Shivnath Babu and Herodotos Herodotou: “Massively Parallel Databases and MapReduce Systems,” Foundations and Trends in Databases, volume 5, number 1, pages 1–104, November 2013. doi:10.1561/1900000036

[ 4 ] David J. DeWitt 和 Michael Stonebraker:“ MapReduce:一大倒退”,最初发表于databasecolumn.vertica.com,2008 年 1 月 17 日。

[4] David J. DeWitt and Michael Stonebraker: “MapReduce: A Major Step Backwards,” originally published at databasecolumn.vertica.com, January 17, 2008.

[ 5 ] Henry Robinson:“大象是特洛伊木马:论 Google Map-Reduce 的消亡”, the-paper-trail.org,2014 年 6 月 25 日。

[5] Henry Robinson: “The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google,” the-paper-trail.org, June 25, 2014.

[ 6 ]“霍尔瑞斯机器”,美国人口普查局,census.gov

[6] “The Hollerith Machine,” United States Census Bureau, census.gov.

[ 7 ]“ IBM 82、83 和 84 分拣机参考手册”,A24-1034-1 版,国际商业机器公司,1962 年 7 月。

[7] “IBM 82, 83, and 84 Sorters Reference Manual,” Edition A24-1034-1, International Business Machines Corporation, July 1962.

[ 8 ] Adam Drake:“命令行工具可以比 Hadoop 集群快 235 倍”,aadrake.com,2014 年 1 月 25 日。

[8] Adam Drake: “Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster,” aadrake.com, January 25, 2014.

[ 9 ]“ GNU Coreutils 8.23 文档”,自由软件基金会,2014 年。

[9] “GNU Coreutils 8.23 Documentation,” Free Software Foundation, Inc., 2014.

[ 10 ] Martin Kleppmann:“ Kafka、Samza 和分布式数据的 Unix 哲学”,martin.kleppmann.com,2015 年 8 月 5 日。

[10] Martin Kleppmann: “Kafka, Samza, and the Unix Philosophy of Distributed Data,” martin.kleppmann.com, August 5, 2015.

[ 11 ] Doug McIlroy: 贝尔实验室内部备忘录,1964 年 10 月。引自:Dennis M. Richie:“ Doug McIlroy 的建议”, cm.bell-labs.com

[11] Doug McIlroy: Internal Bell Labs memo, October 1964. Cited in: Dennis M. Richie: “Advice from Doug McIlroy,” cm.bell-labs.com.

[ 12 ] MD McIlroy、EN Pinson 和 BA Tague:“ UNIX 分时系统:前言”, 贝尔系统技术期刊,第 57 卷,第 6 期,第 1899-1904 页,1978 年 7 月。

[12] M. D. McIlroy, E. N. Pinson, and B. A. Tague: “UNIX Time-Sharing System: Foreword,” The Bell System Technical Journal, volume 57, number 6, pages 1899–1904, July 1978.

[ 13 ] Eric S. Raymond: UNIX 编程的艺术。艾迪生·韦斯利,2003。ISBN:978-0-13-142901-7

[13] Eric S. Raymond: The Art of UNIX Programming. Addison-Wesley, 2003. ISBN: 978-0-13-142901-7

[ 14 ] Ronald Duncan:“文本文件格式 – ASCII 分隔文本 – 非 CSV 或 TAB 分隔文本”, ronaldduncan.wordpress.com,2009 年 10 月 31 日。

[14] Ronald Duncan: “Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text,” ronaldduncan.wordpress.com, October 31, 2009.

[ 15 ] Alan Kay:“ ‘软件工程’是一个矛盾吗?,” tinlizzie.org

[15] Alan Kay: “Is ‘Software Engineering’ an Oxymoron?,” tinlizzie.org.

[ 16 ] Martin Fowler:“ InversionOfControl ”, martinfowler.com,2005 年 6 月 26 日。

[16] Martin Fowler: “InversionOfControl,” martinfowler.com, June 26, 2005.

[ 17 ] Daniel J. Bernstein:“套接字的两个文件描述符”,cr.yp.to

[17] Daniel J. Bernstein: “Two File Descriptors for Sockets,” cr.yp.to.

[ 18 ] Rob Pike 和 Dennis M. Ritchie:“分布式系统的 Styx 架构”,贝尔实验室技术期刊,第 4 卷,第 2 期,第 146-152 页,1999 年 4 月。

[18] Rob Pike and Dennis M. Ritchie: “The Styx Architecture for Distributed Systems,” Bell Labs Technical Journal, volume 4, number 2, pages 146–152, April 1999.

[ 19 ] Sanjay Ghemawat、Howard Gobioff 和 Shun-Tak Leung:“ The Google File System ”,第 19 届 ACM 操作系统原理研讨会(SOSP),2003 年 10 月 。doi:10.1145/945445.945450

[19] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “The Google File System,” at 19th ACM Symposium on Operating Systems Principles (SOSP), October 2003. doi:10.1145/945445.945450

[ 20 ] Michael Ovsiannikov、Silvius Rus、Damian Reeves 等人:“ The Quantcast File System ” , VLDB Endowment 论文集,第 6 卷,第 11 期,第 1092–1101 页,2013 年 8 月 。doi:10.14778/2536222.2536234

[20] Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “The Quantcast File System,” Proceedings of the VLDB Endowment, volume 6, number 11, pages 1092–1101, August 2013. doi:10.14778/2536222.2536234

[ 21 ]“ OpenStack Swift 2.6.1 开发人员文档”,OpenStack 基金会,docs.openstack.org,2016 年 3 月。

[21] “OpenStack Swift 2.6.1 Developer Documentation,” OpenStack Foundation, docs.openstack.org, March 2016.

[ 22 ] 张哲、王安德、郑凯等人:“ Apache Hadoop 中的 HDFS 纠删码简介”,blog.cloudera.com,2015 年 9 月 23 日。

[22] Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “Introduction to HDFS Erasure Coding in Apache Hadoop,” blog.cloudera.com, September 23, 2015.

[ 23 ] Peter Cnudde:“ Hadoop 迎来 10 周年”, yahoohadoop.tumblr.com,2016 年 2 月 5 日。

[23] Peter Cnudde: “Hadoop Turns 10,” yahoohadoop.tumblr.com, February 5, 2016.

[ 24 ] Eric Baldeschwieler:“思考 HDFS 与其他存储技术”,hortonworks.com,2012 年 7 月 25 日。

[24] Eric Baldeschwieler: “Thinking About the HDFS vs. Other Storage Technologies,” hortonworks.com, July 25, 2012.

[ 25 ] Brendan Gregg:“ Manta:Unix 遇上 Map Reduce ”,dtrace.org,2013 年 6 月 25 日。

[25] Brendan Gregg: “Manta: Unix Meets Map Reduce,” dtrace.org, June 25, 2013.

[ 26 ] Tom White:Hadoop:权威指南,第四版。奥莱利媒体,2015。ISBN:978-1-491-90163-2

[26] Tom White: Hadoop: The Definitive Guide, 4th edition. O’Reilly Media, 2015. ISBN: 978-1-491-90163-2

[ 27 ] Jim N. Gray:“分布式计算经济学”,微软研究技术报告 MSR-TR-2003-24,2003 年 3 月。

[27] Jim N. Gray: “Distributed Computing Economics,” Microsoft Research Tech Report MSR-TR-2003-24, March 2003.

[ 28 ] Márton Trencséni:“ Luigi vs Airflow vs Pinball ”, bytepawn.com,2016 年 2 月 6 日。

[28] Márton Trencséni: “Luigi vs Airflow vs Pinball,” bytepawn.com, February 6, 2016.

[ 29 ] Roshan Sumbaly、Jay Kreps 和 Sam Shah:“ LinkedIn 的‘大数据’生态系统”,ACM 国际数据管理会议 (SIGMOD),2013 年 7 月 。doi:10.1145/2463676.2463707

[29] Roshan Sumbaly, Jay Kreps, and Sam Shah: “The ‘Big Data’ Ecosystem at LinkedIn,” at ACM International Conference on Management of Data (SIGMOD), July 2013. doi:10.1145/2463676.2463707

[ 30 ] Alan F. Gates、Olga Natkovich、Shubham Chopra 等人:“在 Map-Reduce 之上构建高级数据流系统:猪的经验”,第 35 届国际超大型数据库会议(VLDB) ,2009 年 8 月。

[30] Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience,” at 35th International Conference on Very Large Data Bases (VLDB), August 2009.

[ 31 ] Ashish Suchoo、Joydeep Sen Sarma、Namit Jain 等人:“ Hive – 使用 Hadoop 的 PB 级数据仓库”,第26 届 IEEE 国际数据工程会议(ICDE),2010 年 3 月 。doi:10.1109/ICDE。 2010.5447738

[31] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “Hive – A Petabyte Scale Data Warehouse Using Hadoop,” at 26th IEEE International Conference on Data Engineering (ICDE), March 2010. doi:10.1109/ICDE.2010.5447738

[ 32 ]“ Cascading 3.0 用户指南”,Concurrent, Inc.,docs.cascading.org,2016 年 1 月。

[32] “Cascading 3.0 User Guide,” Concurrent, Inc., docs.cascading.org, January 2016.

[ 33 ]“ Apache Crunch 用户指南”,Apache 软件基金会,crunch.apache.org

[33] “Apache Crunch User Guide,” Apache Software Foundation, crunch.apache.org.

[ 34 ] Craig Chambers、Ashish Raniwala、Frances Perry 等人:“ FlumeJava:简单、高效的数据并行管道”,第31 届 ACM SIGPLAN 编程语言设计和实现(PLDI) 会议,2010 年 6 月 。doi:10.1145/ 1806596.1806638

[34] Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “FlumeJava: Easy, Efficient Data-Parallel Pipelines,” at 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2010. doi:10.1145/1806596.1806638

[ 35 ] Jay Kreps:“为什么本地状态是流处理中的基本原语”,oreilly.com,2014 年 7 月 31 日。

[35] Jay Kreps: “Why Local State is a Fundamental Primitive in Stream Processing,” oreilly.com, July 31, 2014.

[ 36 ] Martin Kleppmann:“重新思考 Web 应用程序中的缓存”,martin.kleppmann.com,2012 年 10 月 1 日。

[36] Martin Kleppmann: “Rethinking Caching in Web Apps,” martin.kleppmann.com, October 1, 2012.

[ 37 ]Mark Grover、Ted Malaska、Jonathan Seidman 和 Gwen Shapira:Hadoop 应用程序架构。奥莱利媒体,2015 年。ISBN:978-1-491-90004-8

[37] Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: Hadoop Application Architectures. O’Reilly Media, 2015. ISBN: 978-1-491-90004-8

[ 38 ] Philippe Ajoux、Nathan Bronson、Sanjeev Kumar 等人:“大规模采用更强一致性的挑战”,第 15 届 USENIX 操作系统热门主题研讨会(HotOS),2015 年 5 月。

[38] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “Challenges to Adopting Stronger Consistency at Scale,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[ 39 ] Sriranjan Manjunath:“倾斜连接”,wiki.apache.org,2009。

[39] Sriranjan Manjunath: “Skewed Join,” wiki.apache.org, 2009.

[ 40 ] David J. DeWitt、Jeffrey F. Naughton、Donovan A. Schneider 和 S. Seshadri:“并行连接中的实用倾斜处理”,第 18 届超大型数据库国际会议(VLDB),1992 年 8 月。

[40] David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “Practical Skew Handling in Parallel Joins,” at 18th International Conference on Very Large Data Bases (VLDB), August 1992.

[ 41 ] Marcel Kornacker、Alexander Behm、Victor Bittorf 等人:“ Impala:用于 Hadoop 的现代开源 SQL 引擎”,第七届创新数据系统研究双年会(CIDR),2015 年 1 月。

[41] Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “Impala: A Modern, Open-Source SQL Engine for Hadoop,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[ 42 ]Matthieu Monsch:“开源 PalDB,用于存储辅助数据的轻量级伴侣”,engineering.linkedin.com,2015 年 10 月 26 日。

[42] Matthieu Monsch: “Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data,” engineering.linkedin.com, October 26, 2015.

[ 43 ] Daniel Peng 和 Frank Dabek:“使用分布式事务和通知进行大规模增量处理”,第 9 届 USENIX 操作系统设计和实现(OSDI) 会议,2010 年 10 月。

[43] Daniel Peng and Frank Dabek: “Large-Scale Incremental Processing Using Distributed Transactions and Notifications,” at 9th USENIX conference on Operating Systems Design and Implementation (OSDI), October 2010.

[ 44 ] “ “Cloudera 搜索用户指南”, Cloudera, Inc.,2015 年 9 月。

[44] ““Cloudera Search User Guide,” Cloudera, Inc., September 2015.

[ 45 ] Lili Wu、Sam Shah、Sean Choi 等人:“ The Browsemaps:LinkedIn 的协作过滤”,第 6 届推荐系统和社交网络(RSWeb) 研讨会,2014 年 10 月。

[45] Lili Wu, Sam Shah, Sean Choi, et al.: “The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop on Recommender Systems and the Social Web (RSWeb), October 2014.

[ 46 ] Roshan Sumbaly、Jay Kreps、Lei Gau 等人:“通过 Project Voldemort 提供大规模批量计算数据”,第 10 届 USENIX 文件和存储技术会议(FAST),2012 年 2 月。

[46] Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “Serving Large-Scale Batch Computed Data with Project Voldemort,” at 10th USENIX Conference on File and Storage Technologies (FAST), February 2012.

[ 47 ] Varun Sharma:“开源 Terrapin:批量生成数据的服务系统”,engineering.pinterest.com,2015 年 9 月 14 日。

[47] Varun Sharma: “Open-Sourcing Terrapin: A Serving System for Batch Generated Data,” engineering.pinterest.com, September 14, 2015.

[ 48 ] Nathan Marz:“ ElephantDB ”,slideshare.net,2011 年 5 月 30 日。

[48] Nathan Marz: “ElephantDB,” slideshare.net, May 30, 2011.

[ 49 ] Jean-Daniel (JD) Cryans:“操作方法:使用 HBase 批量加载以及原因”,blog.cloudera.com,2013 年 9 月 27 日。

[49] Jean-Daniel (JD) Cryans: “How-to: Use HBase Bulk Loading, and Why,” blog.cloudera.com, September 27, 2013.

[ 50 ] Nathan Marz:“如何击败 CAP 定理”,nathanmarz.com,2011 年 10 月 13 日。

[50] Nathan Marz: “How to Beat the CAP Theorem,” nathanmarz.com, October 13, 2011.

[ 51 ] Molly Bartlett Dishman 和 Martin Fowler:“敏捷架构”,O'Reilly 软件架构会议,2015 年 3 月。

[51] Molly Bartlett Dishman and Martin Fowler: “Agile Architecture,” at O’Reilly Software Architecture Conference, March 2015.

[ 52 ] David J. DeWitt 和 Jim N. Gray:“并行数据库系统:高性能数据库系统的未来”, Communications of the ACM,第 35 卷,第 6 期,第 85-98 页,1992 年 6 月 。doi:10.1145/ 129888.129894

[52] David J. DeWitt and Jim N. Gray: “Parallel Database Systems: The Future of High Performance Database Systems,” Communications of the ACM, volume 35, number 6, pages 85–98, June 1992. doi:10.1145/129888.129894

[ 53 ] Jay Kreps:“但是多租户实际上真的很难”,tweetstorm,twitter.com,2014 年 10 月 31 日。

[53] Jay Kreps: “But the multi-tenancy thing is actually really really hard,” tweetstorm, twitter.com, October 31, 2014.

[ 54 ] Jeffrey Cohen、Brian Dolan、Mark Dunlap 等人:“ MAD 技能:大数据的新分析实践”,VLDB Endowment 会议记录,第 2 卷,第 2 期,第 1481-1492 页,2009 年 8 月 。 10.14778/1687553.1687576

[54] Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “MAD Skills: New Analysis Practices for Big Data,” Proceedings of the VLDB Endowment, volume 2, number 2, pages 1481–1492, August 2009. doi:10.14778/1687553.1687576

[ 55 ] Ignacio Terrizzano、Peter Schwarz、Mary Roth 和 John E. Colino:“数据争论:从荒野到湖泊的挑战之旅”,第七届创新数据系统研究双年度会议(CIDR),2015 年 1 月。

[55] Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “Data Wrangling: The Challenging Journey from the Wild to the Lake,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[ 56 ] Paige Roberts:“读取模式还是写入模式,这是 Hadoop 数据湖问题”,adaptivesystemsinc.com,2015 年 7 月 2 日。

[56] Paige Roberts: “To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question,” adaptivesystemsinc.com, July 2, 2015.

[ 57 ] Bobby Johnson 和 Joseph Adler:“寿司原理:原始数据更好”, Strata+Hadoop World,2015 年 2 月。

[57] Bobby Johnson and Joseph Adler: “The Sushi Principle: Raw Data Is Better,” at Strata+Hadoop World, February 2015.

[ 58 ] Vinod Kumar Vavilapalli、Arun C. Murthy、Chris Douglas 等人:“ Apache Hadoop YARN:又一个资源谈判者”,第 4 届 ACM 云计算研讨会(SoCC),2013 年 10 月 。doi:10.1145/2523616.2523633

[58] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “Apache Hadoop YARN: Yet Another Resource Negotiator,” at 4th ACM Symposium on Cloud Computing (SoCC), October 2013. doi:10.1145/2523616.2523633

[ 59 ] Abhishek Verma、Luis Pedrosa、Madhukar Korupolu 等人:“ Large-Scale Cluster Management at Google with Borg ”,第 10 届欧洲计算机系统会议(EuroSys),2015 年 4 月 。doi:10.1145/2741948.2741964

[59] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “Large-Scale Cluster Management at Google with Borg,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741964

[ 60 ] Malte Schwarzkopf:“集群调度器架构的演变”,firmament.io,2016 年 3 月 9 日。

[60] Malte Schwarzkopf: “The Evolution of Cluster Scheduler Architectures,” firmament.io, March 9, 2016.

[ 61 ] Matei Zaharia、Mosharaf Chowdhury、Tathagata Das 等人:“弹性分布式数据集:内存集群计算的容错抽象”,第 9 届 USENIX 网络系统设计与实现(NSDI) 研讨会,2012 年 4 月。

[61] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” at 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012.

[ 62 ] Holden Karau、Andy Konwinski、Patrick Wendell 和 Matei Zaharia:学习 Spark。奥莱利媒体,2015。ISBN:978-1-449-35904-1

[62] Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: Learning Spark. O’Reilly Media, 2015. ISBN: 978-1-449-35904-1

[ 63 ] Bikas Saha 和 Hitesh Shah:“ Apache Tez:加速 Hadoop 查询处理”,Hadoop 峰会,2014 年 6 月。

[63] Bikas Saha and Hitesh Shah: “Apache Tez: Accelerating Hadoop Query Processing,” at Hadoop Summit, June 2014.

[ 64 ] Bikas Saha、Hitesh Shah、Siddharth Seth 等人:“ Apache Tez:建模和构建数据处理应用程序的统一框架”,ACM 国际数据管理会议(SIGMOD),2015 年 6 月 。doi:10.1145 /2723372.2742790

[64] Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2742790

[ 65 ] Kostas Tzoumas:“ Apache Flink:API、运行时和项目路线图”,slideshare.net,2015 年 1 月 14 日。

[65] Kostas Tzoumas: “Apache Flink: API, Runtime, and Project Roadmap,” slideshare.net, January 14, 2015.

[ 66 ] Alexander Alexandrov、Rico Bergmann、Stephan Ewen 等人:“ The Stratosphere Platform for Big Data Analytics ”,VLDB Journal,第 23 卷,第 6 期,第 939-964 页,2014 年 5 月 。doi:10.1007/s00778- 014-0357-y

[66] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “The Stratosphere Platform for Big Data Analytics,” The VLDB Journal, volume 23, number 6, pages 939–964, May 2014. doi:10.1007/s00778-014-0357-y

[ 67 ] Michael Isard、Mihai Budiu、Yuan Yu 等人:“ Dryad:来自顺序构建块的分布式数据并行程序”,欧洲计算机系统会议(EuroSys),2007 年 3 月 。doi:10.1145/1272996.1273005

[67] Michael Isard, Mihai Budiu, Yuan Yu, et al.: “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” at European Conference on Computer Systems (EuroSys), March 2007. doi:10.1145/1272996.1273005

[ 68 ] Daniel Warneke 和 Odej Kao:“ Nephele:云端高效并行数据处理”,第二届网格和超级计算机多任务计算(MTAGS) 研讨会,2009 年 11 月 。doi:10.1145/1646468.1646476

[68] Daniel Warneke and Odej Kao: “Nephele: Efficient Parallel Data Processing in the Cloud,” at 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS), November 2009. doi:10.1145/1646468.1646476

[ 69 ]Lawrence Page、Sergey Brin、Rajeev Motwani 和 Terry Winograd:“ PageRank引文排名:为网络带来秩序”,Stanford InfoLab 技术报告 422,1999 年。

[69] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford InfoLab Technical Report 422, 1999.

[ 70 ] Leslie G. Valiant:“并行计算的桥接模型”, ACM 通讯,第 33 卷,第 8 期,第 103–111 页,1990 年 8 月 。doi:10.1145/79173.79181

[70] Leslie G. Valiant: “A Bridging Model for Parallel Computation,” Communications of the ACM, volume 33, number 8, pages 103–111, August 1990. doi:10.1145/79173.79181

[ 71 ] Stephan Ewen、Kostas Tzoumas、Moritz Kaufmann 和 Volker Markl:“旋转快速迭代数据流”,VLDB Endowment 论文集,第 5 卷,第 11 期,第 1268-1279 页,2012 年 7 月 。doi:10.14778/2350229.2350245

[71] Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “Spinning Fast Iterative Data Flows,” Proceedings of the VLDB Endowment, volume 5, number 11, pages 1268-1279, July 2012. doi:10.14778/2350229.2350245

[ 72 ] Grzegorz Malewicz、Matthew H. Austern、Aart JC Bik 等人:“ Pregel:大规模图形处理系统”,ACM 国际数据管理会议(SIGMOD),2010 年 6 月 。doi:10.1145 /1807167.1807184

[72] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “Pregel: A System for Large-Scale Graph Processing,” at ACM International Conference on Management of Data (SIGMOD), June 2010. doi:10.1145/1807167.1807184

[ 73 ] Frank McSherry、Michael Isard 和 Derek G. Murray:“可扩展性!但代价是什么?”, 第 15 届 USENIX 操作系统热门话题研讨会(HotOS),2015 年 5 月。

[73] Frank McSherry, Michael Isard, and Derek G. Murray: “Scalability! But at What COST?,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[ 74 ] Ionel Gog、Malte Schwarzkopf、Natacha Crooks 等人:“火枪手:数据处理系统中的 All for One,One for All ”,第 10 届欧洲计算机系统会议(EuroSys),2015 年 4 月 。doi:10.1145/ 2741948.2741968

[74] Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “Musketeer: All for One, One for All in Data Processing Systems,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741968

[ 75 ] Aapo Kyrola、Guy Blelloch 和 Carlos Guestrin:“ GraphChi:仅在 PC 上进行大规模图形计算”,第 10 届 USENIX 操作系统设计与实现研讨会(OSDI),2012 年 10 月。

[75] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “GraphChi: Large-Scale Graph Computation on Just a PC,” at 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2012.

[ 76 ] Andrew Lenharth、Donald Nguyen 和 Keshav Pingali:“并行图分析”,ACM 通讯,第 59 卷,第 5 期,第 78-87 页,2016 年 5 月。doi:10.1145/2901919

[76] Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “Parallel Graph Analytics,” Communications of the ACM, volume 59, number 5, pages 78–87, May 2016. doi:10.1145/2901919

[ 77 ] Fabian Hüske:“窥探 Apache Flink 的引擎室”,flink.apache.org,2015 年 3 月 13 日。

[77] Fabian Hüske: “Peeking into Apache Flink’s Engine Room,” flink.apache.org, March 13, 2015.

[ 78 ] Mostafa Mokhtar:“ Hive 0.14 基于成本的优化器 (CBO) 技术概述”,hortonworks.com,2015 年 3 月 2 日。

[78] Mostafa Mokhtar: “Hive 0.14 Cost Based Optimizer (CBO) Technical Overview,” hortonworks.com, March 2, 2015.

[ 79 ] Michael Armbrust、Reynold S Xin、Cheng Lian 等人:“ Spark SQL:Spark 中的关系数据处理”,ACM 国际数据管理会议(SIGMOD),2015 年 6 月 。doi:10.1145/2723372.2742797

[79] Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “Spark SQL: Relational Data Processing in Spark,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2742797

[ 80 ] Daniel Blazevski:“为 Apache Flink 种植四叉树”,insightdataengineering.com,2016 年 3 月 25 日。

[80] Daniel Blazevski: “Planting Quadtrees for Apache Flink,” insightdataengineering.com, March 25, 2016.

[ 81 ] Tom White:“基因组分析工具包:现在使用 Apache Spark 进行数据处理”,blog.cloudera.com,2016 年 4 月 6 日。

[81] Tom White: “Genome Analysis Toolkit: Now Using Apache Spark for Data Processing,” blog.cloudera.com, April 6, 2016.

第 11 章流处理

Chapter 11. Stream Processing

一个有效的复杂系统总是由一个有效的简单系统演化而来。相反的命题似乎也是正确的:从头开始设计的复杂系统永远不会工作,也无法使其工作。

约翰·加尔,系统学(1975)

A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.

John Gall, Systemantics (1975)

第 10 章中,我们讨论了批处理——读取一组文件作为输入并生成一组新的输出文件的技术。输出是导出数据的一种形式;也就是说,如果需要,可以通过再次运行批处理来重新创建数据集。我们看到了如何使用这个简单但强大的想法来创建搜索索引、推荐系统、分析等。

In Chapter 10 we discussed batch processing—techniques that read a set of files as input and produce a new set of output files. The output is a form of derived data; that is, a dataset that can be recreated by running the batch process again if necessary. We saw how this simple but powerful idea can be used to create search indexes, recommendation systems, analytics, and more.

然而,第 10 章 中始终存在一个重要假设:即输入是有界的(即,具有已知且有限的大小),因此批处理过程知道何时完成了输入的读取。例如,MapReduce 的核心排序操作必须先读取其整个输入,然后才能开始生成输出:最后一个输入记录可能是具有最低键的记录,因此需要成为第一个输出记录,所以提前开始输出不是一个选择。

However, one big assumption remained throughout Chapter 10: namely, that the input is bounded—i.e., of a known and finite size—so the batch process knows when it has finished reading its input. For example, the sorting operation that is central to MapReduce must read its entire input before it can start producing output: it could happen that the very last input record is the one with the lowest key, and thus needs to be the very first output record, so starting the output early is not an option.

实际上,许多数据是无限的,因为它们随着时间的推移逐渐到达:您的用户昨天和今天生成了数据,他们明天将继续生成更多数据。除非您倒闭,否则这个过程永远不会结束,因此数据集永远不会以任何有意义的方式“完整”[ 1 ]。因此,批处理器必须人为地将数据划分为固定持续时间的块:例如,在每天结束时处理一天的数据,或者在每小时结束时处理一小时的数据。

In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way [1]. Thus, batch processors must artificially divide the data into chunks of fixed duration: for example, processing a day’s worth of data at the end of every day, or processing an hour’s worth of data at the end of every hour.

每日批处理的问题在于,输入的变化仅反映在一天后的输出中,这对于许多不耐烦的用户来说太慢了。为了减少延迟,我们可以更频繁地运行处理,例如,在每一秒结束时处理一秒的数据,甚至连续地完全放弃固定时间片,而简单地在每个事件发生时处理它。这就是流处理背后的想法。

The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users. To reduce the delay, we can run the processing more frequently—say, processing a second’s worth of data at the end of every second—or even continuously, abandoning the fixed time slices entirely and simply processing every event as it happens. That is the idea behind stream processing.

一般来说,“流”是指随着时间的推移逐渐提供的数据。这个概念出现在很多地方:stdinUnix stdout、编程语言(惰性列表)[ 2 ]、文件系统 API(例如 Java 的FileInputStream)、TCP 连接、通过互联网传送音频和视频等等。

In general, a “stream” refers to data that is incrementally made available over time. The concept appears in many places: in the stdin and stdout of Unix, programming languages (lazy lists) [2], filesystem APIs (such as Java’s FileInputStream), TCP connections, delivering audio and video over the internet, and so on.

在本章中,我们将把事件流视为一种数据管理机制:与我们在上一章中看到的批处理数据相对应的无界、增量处理的副本 。我们将首先讨论如何在网络上表示、存储和传输流。在“数据库和流”中,我们将研究 流和数据库之间的关系。最后,在“处理流”中,我们将探索持续处理这些流的方法和工具,以及它们可用于构建应用程序的方法。

In this chapter we will look at event streams as a data management mechanism: the unbounded, incrementally processed counterpart to the batch data we saw in the last chapter. We will first discuss how streams are represented, stored, and transmitted over a network. In “Databases and Streams” we will investigate the relationship between streams and databases. And finally, in “Processing Streams” we will explore approaches and tools for processing those streams continually, and ways that they can be used to build applications.

传输事件流

Transmitting Event Streams

在批处理世界中,作业的输入和输出是文件(可能在分布式文件系统上)。等效的流媒体是什么样子的?

In the batch processing world, the inputs and outputs of a job are files (perhaps on a distributed filesystem). What does the streaming equivalent look like?

当输入是文件(字节序列)时,第一个处理步骤通常是将其解析为记录序列。在流处理上下文中,记录通常被称为 事件,但它本质上是同一件事:一个小的、独立的、不可变的对象,包含在某个时间点发生的事情的详细信息。事件通常包含一个时间戳,根据时钟指示事件发生的时间(请参阅“单调时钟与时钟”)。

When the input is a file (a sequence of bytes), the first processing step is usually to parse it into a sequence of records. In a stream processing context, a record is more commonly known as an event, but it is essentially the same thing: a small, self-contained, immutable object containing the details of something that happened at some point in time. An event usually contains a timestamp indicating when it happened according to a time-of-day clock (see “Monotonic Versus Time-of-Day Clocks”).

例如,发生的事情可能是用户采取的操作,例如查看页面或进行购买。它还可能源自机器,例如温度传感器的定期测量或 CPU 利用率指标。在“使用 Unix 工具进行批处理”的示例中,Web 服务器日志的每一行都是一个事件。

For example, the thing that happened might be an action that a user took, such as viewing a page or making a purchase. It might also originate from a machine, such as a periodic measurement from a temperature sensor, or a CPU utilization metric. In the example of “Batch Processing with Unix Tools”, each line of the web server log is an event.

事件可以被编码为文本字符串、JSON,或者可能是某种二进制形式,如 第 4 章所述。此编码允许您存储事件,例如将其附加到文件、将其插入关系表或将其写入文档数据库。它还允许您通过网络将事件发送到另一个节点以便处理它。

An event may be encoded as a text string, or JSON, or perhaps in some binary form, as discussed in Chapter 4. This encoding allows you to store an event, for example by appending it to a file, inserting it into a relational table, or writing it to a document database. It also allows you to send the event over the network to another node in order to process it.

在批处理中,文件被写入一次,然后可能被多个作业读取。类似地,在流术语中,事件由生产者(也称为发布者发送者)生成一次,然后可能由多个消费者订阅者接收者)处理[ 3 ]。在文件系统中,文件名标识一组相关记录;在流系统中,相关事件通常被分组到一个主题中。

In batch processing, a file is written once and then potentially read by multiple jobs. Analogously, in streaming terminology, an event is generated once by a producer (also known as a publisher or sender), and then potentially processed by multiple consumers (subscribers or recipients) [3]. In a filesystem, a filename identifies a set of related records; in a streaming system, related events are usually grouped together into a topic or stream.

原则上,文件或数据库足以连接生产者和消费者:生产者将其生成的每个事件写入数据存储,每个消费者定期轮询数据存储以检查自上次运行以来出现的事件。这本质上就是批处理在每天结束时处理一天的数据时所做的事情。

In principle, a file or database is sufficient to connect producers and consumers: a producer writes every event that it generates to the datastore, and each consumer periodically polls the datastore to check for events that have appeared since it last ran. This is essentially what a batch process does when it processes a day’s worth of data at the end of every day.

然而,当转向低延迟的连续处理时,如果数据存储不是针对这种用途而设计的,那么轮询就会变得昂贵。轮询越频繁,返回新事件的请求百分比就越低,因此开销就越高。相反,当新事件出现时,消费者最好得到通知。

However, when moving toward continual processing with low delays, polling becomes expensive if the datastore is not designed for this kind of usage. The more often you poll, the lower the percentage of requests that return new events, and thus the higher the overheads become. Instead, it is better for consumers to be notified when new events appear.

传统上,数据库并不能很好地支持这种通知机制:关系数据库通常具有触发器,可以对更改做出反应(例如,将一行插入表中),但它们可以做的事情非常有限,而且已经被限制了。在数据库设计中有点事后的想法 [ 4 , 5 ]。相反,已经开发了专门的工具来传递事件通知。

Databases have traditionally not supported this kind of notification mechanism very well: relational databases commonly have triggers, which can react to a change (e.g., a row being inserted into a table), but they are very limited in what they can do and have been somewhat of an afterthought in database design [4, 5]. Instead, specialized tools have been developed for the purpose of delivering event notifications.

消息系统

Messaging Systems

通知消费者有关新事件的常见方法是使用消息传递系统:生产者发送包含该事件的消息,然后将其推送给消费者。我们之前在“消息传递数据流”中接触过这些系统,但现在我们将进行更详细的介绍。

A common approach for notifying consumers about new events is to use a messaging system: a producer sends a message containing the event, which is then pushed to consumers. We touched on these systems previously in “Message-Passing Dataflow”, but we will now go into more detail.

生产者和消费者之间的直接通信通道(如 Unix 管道或 TCP 连接)将是实现消息传递系统的简单方法。然而,大多数消息传递系统都扩展了这个基本模型。特别是,Unix 管道和 TCP 将一个发送者与一个接收者连接起来,而消息系统允许多个生产者节点向同一主题发送消息,并允许多个消费者节点接收同一主题中的消息。

A direct communication channel like a Unix pipe or TCP connection between producer and consumer would be a simple way of implementing a messaging system. However, most messaging systems expand on this basic model. In particular, Unix pipes and TCP connect exactly one sender with one recipient, whereas a messaging system allows multiple producer nodes to send messages to the same topic and allows multiple consumer nodes to receive messages in a topic.

在这种发布/订阅模型中,不同的系统采用多种方法,并且没有一个适合所有目的的正确答案。为了区分这些系统,提出以下两个问题特别有帮助:

Within this publish/subscribe model, different systems take a wide range of approaches, and there is no one right answer for all purposes. To differentiate the systems, it is particularly helpful to ask the following two questions:

  1. 如果生产者发送消息的速度比消费者处理消息的速度快,会发生什么情况? 一般来说,有三种选择:系统可以丢弃消息、在队列中缓冲消息或应用背压(也称为流量控制;即阻止生产者发送更多消息)。例如,Unix 管道和 TCP 使用背压:它们有一个固定大小的小型缓冲区,如果缓冲区已满,发送方将被阻塞,直到接收方从缓冲区中取出数据(请参阅“网络拥塞和排队”

    如果消息缓冲在队列中,那么了解队列增长时会发生什么情况非常重要。如果队列不再适合内存,系统是否会崩溃,或者是否将消息写入磁盘?如果是这样,磁盘访问如何影响消息系统的性能[ 6 ]?

  2. What happens if the producers send messages faster than the consumers can process them? Broadly speaking, there are three options: the system can drop messages, buffer messages in a queue, or apply backpressure (also known as flow control; i.e., blocking the producer from sending more messages). For example, Unix pipes and TCP use backpressure: they have a small fixed-size buffer, and if it fills up, the sender is blocked until the recipient takes data out of the buffer (see “Network congestion and queueing”).

    If messages are buffered in a queue, it is important to understand what happens as that queue grows. Does the system crash if the queue no longer fits in memory, or does it write messages to disk? If so, how does the disk access affect the performance of the messaging system [6]?

  3. 如果节点崩溃或暂时离线会发生什么情况——是否有消息丢失?与数据库一样,持久性可能需要写入磁盘和/或复制的某种组合(请参阅侧边栏 “复制和持久性”),这是有成本的。如果您可以承受有时丢失消息的情况,那么您可能可以在相同的硬件上获得更高的吞吐量和更低的延迟。

  4. What happens if nodes crash or temporarily go offline—are any messages lost? As with databases, durability may require some combination of writing to disk and/or replication (see the sidebar “Replication and Durability”), which has a cost. If you can afford to sometimes lose messages, you can probably get higher throughput and lower latency on the same hardware.

消息丢失是否可以接受很大程度上取决于应用程序。例如,对于定期传输的传感器读数和指标,偶尔丢失的数据点可能并不重要,因为无论如何都会在短时间内发送更新的值。但是,请注意,如果丢弃大量消息,可能不会立即看出指标不正确[ 7 ]。如果您正在对事件进行计数,则更重要的是可靠地传递它们,因为每条丢失的消息都意味着计数器不正确。

Whether message loss is acceptable depends very much on the application. For example, with sensor readings and metrics that are transmitted periodically, an occasional missing data point is perhaps not important, since an updated value will be sent a short time later anyway. However, beware that if a large number of messages are dropped, it may not be immediately apparent that the metrics are incorrect [7]. If you are counting events, it is more important that they are delivered reliably, since every lost message means incorrect counters.

我们在第 10 章 中探讨的批处理系统的一个很好的特性是它们提供了强大的可靠性保证:失败的任务会自动重试,并且失败任务的部分输出会被自动丢弃。这意味着输出与没有发生故障一样,这有助于简化编程模型。在本章后面,我们将研究如何在流上下文中提供类似的保证。

A nice property of the batch processing systems we explored in Chapter 10 is that they provide a strong reliability guarantee: failed tasks are automatically retried, and partial output from failed tasks is automatically discarded. This means the output is the same as if no failures had occurred, which helps simplify the programming model. Later in this chapter we will examine how we can provide similar guarantees in a streaming context.

从生产者到消费者的直接消息传递

Direct messaging from producers to consumers

许多消息系统在生产者和消费者之间使用直接网络通信,而不通过中间节点:

A number of messaging systems use direct network communication between producers and consumers without going via intermediary nodes:

  • UDP 多播广泛应用于金融行业的流媒体,例如股票市场源,其中低延迟非常重要 [ 8 ]。尽管UDP本身不可靠,但应用程序级协议可以恢复丢失的数据包(生产者必须记住它已发送的数据包,以便可以根据需要重新传输它们)。

  • UDP multicast is widely used in the financial industry for streams such as stock market feeds, where low latency is important [8]. Although UDP itself is unreliable, application-level protocols can recover lost packets (the producer must remember packets it has sent so that it can retransmit them on demand).

  • 无代理消息传递库(例如 ZeroMQ [ 9 ] 和 nanomsg)采用类似的方法,通过 TCP 或 IP 多播实现发布/订阅消息传递。

  • Brokerless messaging libraries such as ZeroMQ [9] and nanomsg take a similar approach, implementing publish/subscribe messaging over TCP or IP multicast.

  • StatsD [ 10 ] 和 Brubeck [ 7 ] 使用不可靠的 UDP 消息传递从网络上的所有机器收集指标并监控它们。(在 StatsD 协议中,只有收到所有消息时计数器指标才是正确的;使用 UDP 使指标最接近 [ 11 ]。另请参阅 “TCP 与 UDP”。)

  • StatsD [10] and Brubeck [7] use unreliable UDP messaging for collecting metrics from all machines on the network and monitoring them. (In the StatsD protocol, counter metrics are only correct if all messages are received; using UDP makes the metrics at best approximate [11]. See also “TCP Versus UDP”.)

  • 如果消费者在网络上公开服务,则生产者可以直接发出 HTTP 或 RPC 请求(请参阅“通过服务的数据流:REST 和 RPC”)以将消息推送给消费者。这就是 webhooks [ 12 ]背后的想法,在这种模式中,一个服务的回调 URL 注册到另一个服务,并且每当事件发生时它都会向该 URL 发出请求。

  • If the consumer exposes a service on the network, producers can make a direct HTTP or RPC request (see “Dataflow Through Services: REST and RPC”) to push messages to the consumer. This is the idea behind webhooks [12], a pattern in which a callback URL of one service is registered with another service, and it makes a request to that URL whenever an event occurs.

尽管这些直接消息传递系统在其设计的情况下运行良好,但它们通常要求应用程序代码了解消息丢失的可能性。它们可以容忍的故障非常有限:即使协议检测并重新传输网络中丢失的数据包,它们通常也假设生产者和消费者始终在线。

Although these direct messaging systems work well in the situations for which they are designed, they generally require the application code to be aware of the possibility of message loss. The faults they can tolerate are quite limited: even if the protocols detect and retransmit packets that are lost in the network, they generally assume that producers and consumers are constantly online.

如果消费者离线,它可能会错过在无法访问时发送的消息。某些协议允许生产者重试失败的消息传递,但如果生产者崩溃,这种方法可能会失败,从而丢失应该重试的消息缓冲区。

If a consumer is offline, it may miss messages that were sent while it is unreachable. Some protocols allow the producer to retry failed message deliveries, but this approach may break down if the producer crashes, losing the buffer of messages that it was supposed to retry.

消息代理

Message brokers

一种广泛使用的替代方案是通过消息代理(也称为消息队列) 发送消息,消息代理本质上是一种针对处理消息流进行优化的数据库[ 13 ]。它作为服务器运行,生产者和消费者作为客户端连接到它。生产者将消息写入代理,消费者通过从代理读取消息来接收消息。

A widely used alternative is to send messages via a message broker (also known as a message queue), which is essentially a kind of database that is optimized for handling message streams [13]. It runs as a server, with producers and consumers connecting to it as clients. Producers write messages to the broker, and consumers receive them by reading them from the broker.

通过将数据集中在代理中,这些系统可以更轻松地容忍来来往往的客户端(连接、断开连接和崩溃),而持久性问题则转移到代理上。一些消息代理仅将消息保存在内存中,而其他消息代理(取决于配置)将消息写入磁盘,以便在代理崩溃时不会丢失消息。面对缓慢的消费者,他们通常允许无界排队(而不是丢弃消息或背压),尽管这种选择也可能取决于配置。

By centralizing the data in the broker, these systems can more easily tolerate clients that come and go (connect, disconnect, and crash), and the question of durability is moved to the broker instead. Some message brokers only keep messages in memory, while others (depending on configuration) write them to disk so that they are not lost in case of a broker crash. Faced with slow consumers, they generally allow unbounded queueing (as opposed to dropping messages or backpressure), although this choice may also depend on the configuration.

排队的一个后果是消费者通常是异步的:当生产者发送消息时,它通常只等待代理确认它已缓冲该消息,而不会等待消费者处理该消息。向消费者的交付将在某个不确定的未来时间点发生——通常在不到一秒的时间内,但如果存在队列积压,有时会晚得多。

A consequence of queueing is also that consumers are generally asynchronous: when a producer sends a message, it normally only waits for the broker to confirm that it has buffered the message and does not wait for the message to be processed by consumers. The delivery to consumers will happen at some undetermined future point in time—often within a fraction of a second, but sometimes significantly later if there is a queue backlog.

消息代理与数据库的比较

Message brokers compared to databases

一些消息代理甚至可以使用 XA 或 JTA 参与两阶段提交协议(请参阅 “实践中的分布式事务”)。此功能使它们在本质上与数据库非常相似,尽管消息代理和数据库之间仍然存在重要的实际差异:

Some message brokers can even participate in two-phase commit protocols using XA or JTA (see “Distributed Transactions in Practice”). This feature makes them quite similar in nature to databases, although there are still important practical differences between message brokers and databases:

  • 数据库通常会保留数据,直到显式删除为止,而大多数消息代理会在消息成功传递给其使用者后自动删除消息。此类消息代理不适合长期数据存储。

  • Databases usually keep data until it is explicitly deleted, whereas most message brokers automatically delete a message when it has been successfully delivered to its consumers. Such message brokers are not suitable for long-term data storage.

  • 由于它们会快速删除消息,因此大多数消息代理都认为它们的工作集相当小,即队列很短。如果代理需要缓冲大量消息,因为消费者速度较慢(如果消息不再适合内存,则可能会将消息溢出到磁盘),则每条消息需要更长的时间来处理,并且整体吞吐量可能会降低[6 ]

  • Since they quickly delete messages, most message brokers assume that their working set is fairly small—i.e., the queues are short. If the broker needs to buffer a lot of messages because the consumers are slow (perhaps spilling messages to disk if they no longer fit in memory), each individual message takes longer to process, and the overall throughput may degrade [6].

  • 数据库通常支持二级索引和各种搜索数据的方式,而消息代理通常支持订阅与某种模式匹配的主题子集的某种方式。机制不同,但本质上都是客户端选择其想要了解的数据部分的方式。

  • Databases often support secondary indexes and various ways of searching for data, while message brokers often support some way of subscribing to a subset of topics matching some pattern. The mechanisms are different, but both are essentially ways for a client to select the portion of the data that it wants to know about.

  • 查询数据库时,结果通常基于数据的时间点快照;如果另一个客户端随后向数据库写入更改查询结果的内容,则第一个客户端不会发现其先前的结果现已过时(除非它重复查询或轮询更改)。相比之下,消息代理不支持任意查询,但它们会在数据更改时(即,当新消息可用时)通知客户端。

  • When querying a database, the result is typically based on a point-in-time snapshot of the data; if another client subsequently writes something to the database that changes the query result, the first client does not find out that its prior result is now outdated (unless it repeats the query, or polls for changes). By contrast, message brokers do not support arbitrary queries, but they do notify clients when data changes (i.e., when new messages become available).

这是消息代理的传统视图,它封装在 JMS [ 14 ] 和 AMQP [ 15 ] 等标准中,并在 RabbitMQ、ActiveMQ、HornetQ、Qpid、TIBCO Enterprise Message Service、IBM MQ、Azure Service Bus 等软件中实现,谷歌云发布/订阅[ 16 ]。

This is the traditional view of message brokers, which is encapsulated in standards like JMS [14] and AMQP [15] and implemented in software like RabbitMQ, ActiveMQ, HornetQ, Qpid, TIBCO Enterprise Message Service, IBM MQ, Azure Service Bus, and Google Cloud Pub/Sub [16].

多个消费者

Multiple consumers

当多个消费者读取同一主题中的消息时,会使用两种主要的消息传递模式,如图11-1所示:

When multiple consumers read messages in the same topic, two main patterns of messaging are used, as illustrated in Figure 11-1:

负载均衡
Load balancing

每条消息都会传递给其中一个消费者,因此消费者可以分担处理主题中的消息的工作。代理可以任意将消息分配给消费者。当消息的处理成本很高,因此您希望能够添加使用者来并行处理时,此模式非常有用。(在 AMQP 中,您可以通过让多个客户端从同一个队列消费来实现负载均衡,在 JMS 中这称为共享 订阅。)

Each message is delivered to one of the consumers, so the consumers can share the work of processing the messages in the topic. The broker may assign messages to consumers arbitrarily. This pattern is useful when the messages are expensive to process, and so you want to be able to add consumers to parallelize the processing. (In AMQP, you can implement load balancing by having multiple clients consuming from the same queue, and in JMS it is called a shared subscription.)

扇出
Fan-out

每条消息都会传递给所有消费者。扇出允许多个独立的消费者分别“收听”相同的消息广播,而不会相互影响——流式传输相当于有多个不同的批处理作业读取相同的输入文件。(此功能由 JMS 中的主题订阅和 AMQP 中的交换绑定提供。)

Each message is delivered to all of the consumers. Fan-out allows several independent consumers to each “tune in” to the same broadcast of messages, without affecting each other—the streaming equivalent of having several different batch jobs that read the same input file. (This feature is provided by topic subscriptions in JMS, and exchange bindings in AMQP.)

迪迪亚1101
图 11-1。(a) 负载均衡:在消费者之间分担消费某个主题的工作;(b) 扇出:将每条消息传递给多个消费者。

这两种模式可以组合:例如,两个独立的消费者组可以各自订阅一个主题,使得每个组共同接收所有消息,但在每个组内只有一个节点接收每条消息。

The two patterns can be combined: for example, two separate groups of consumers may each subscribe to a topic, such that each group collectively receives all messages, but within each group only one of the nodes receives each message.

致谢和重新投递

Acknowledgments and redelivery

消费者可能随时崩溃,因此可能会发生代理向消费者发送消息但消费者从未处理该消息,或者在崩溃之前仅部分处理该消息的情况。为了确保消息不丢失,消息代理使用确认:客户端必须显式告诉代理何时完成消息处理,以便代理可以将其从队列中删除。

Consumers may crash at any time, so it could happen that a broker delivers a message to a consumer but the consumer never processes it, or only partially processes it before crashing. In order to ensure that the message is not lost, message brokers use acknowledgments: a client must explicitly tell the broker when it has finished processing a message so that the broker can remove it from the queue.

如果与客户端的连接关闭或超时,而代理未收到确认,则它会假定该消息未得到处理,因此它会再次将该消息传递给另一个消费者。(请注意,可能会发生消息实际上完全处理,但确认在网络中丢失的情况。处理这种情况需要原子提交协议,如“分布式事务实践”中所述。)

If the connection to a client is closed or times out without the broker receiving an acknowledgment, it assumes that the message was not processed, and therefore it delivers the message again to another consumer. (Note that it could happen that the message actually was fully processed, but the acknowledgment was lost in the network. Handling this case requires an atomic commit protocol, as discussed in “Distributed Transactions in Practice”.)

当与负载平衡结合使用时,这种重新传递行为会对消息的排序产生有趣的影响。在图11-2中,消费者通常按照生产者发送消息的顺序处理消息。然而,消费者 2 在处理消息m3时崩溃,同时消费者 1 也在处理消息m4。未确认的消息m3随后被重新传递给消费者 1,结果消费者 1 按m4m3m5的顺序处理消息。因此,m3m4的交付顺序与生产者 1 发送的顺序不同。

When combined with load balancing, this redelivery behavior has an interesting effect on the ordering of messages. In Figure 11-2, the consumers generally process messages in the order they were sent by producers. However, consumer 2 crashes while processing message m3, at the same time as consumer 1 is processing message m4. The unacknowledged message m3 is subsequently redelivered to consumer 1, with the result that consumer 1 processes messages in the order m4, m3, m5. Thus, m3 and m4 are not delivered in the same order as they were sent by producer 1.

迪迪亚1102
图 11-2。消费者 2 在处理 m3 时崩溃,因此稍后将其重新传递给消费者 1。

即使消息代理尝试以其他方式保留消息的顺序(按照 JMS 和 AMQP 标准的要求),负载平衡与重新传递的组合也不可避免地会导致消息重新排序。为了避免这个问题,您可以为每个消费者使用单独的队列(即不使用负载平衡功能)。如果消息彼此完全独立,则消息重新排序不是问题,但如果消息之间存在因果依赖性,则消息重新排序可能很重要,正如我们将在本章后面看到的那样。

Even if the message broker otherwise tries to preserve the order of messages (as required by both the JMS and AMQP standards), the combination of load balancing with redelivery inevitably leads to messages being reordered. To avoid this issue, you can use a separate queue per consumer (i.e., not use the load balancing feature). Message reordering is not a problem if messages are completely independent of each other, but it can be important if there are causal dependencies between messages, as we shall see later in the chapter.

分区日志

Partitioned Logs

通过网络发送数据包或向网络服务发出请求通常是暂时性操作,不会留下永久痕迹。尽管可以永久记录它(使用数据包捕获和日志记录),但我们通常不这么认为。即使是持久地将消息写入磁盘的消息代理,在将消息传递给消费者后也会快速删除它们,因为它们是围绕瞬态消息传递思维构建的。

Sending a packet over a network or making a request to a network service is normally a transient operation that leaves no permanent trace. Although it is possible to record it permanently (using packet capture and logging), we normally don’t think of it that way. Even message brokers that durably write messages to disk quickly delete them again after they have been delivered to consumers, because they are built around a transient messaging mindset.

数据库和文件系统采用相反的方法:写入数据库或文件的所有内容通常都应该被永久记录,至少直到有人明确选择再次删除它为止。

Databases and filesystems take the opposite approach: everything that is written to a database or file is normally expected to be permanently recorded, at least until someone explicitly chooses to delete it again.

这种思维方式的差异对派生数据的创建方式有很大影响。正如第 10 章所讨论的,批处理的一个关键特征是您可以重复运行它们,试验处理步骤,而不会损坏输入(因为输入是只读的)。AMQP/JMS 风格消息传递的情况并非如此:如果确认导致消息从代理中删除,则接收消息具有破坏性,因此您无法再次运行相同的使用者并期望获得相同的结果。

This difference in mindset has a big impact on how derived data is created. A key feature of batch processes, as discussed in Chapter 10, is that you can run them repeatedly, experimenting with the processing steps, without risk of damaging the input (since the input is read-only). This is not the case with AMQP/JMS-style messaging: receiving a message is destructive if the acknowledgment causes it to be deleted from the broker, so you cannot run the same consumer again and expect to get the same result.

如果您向消息传递系统添加新的消费者,它通常只会开始接收在其注册之后发送的消息;之前的任何消息都已消失并且无法恢复。与文件和数据库相比,您可以随时添加新客户端,并且它可以读取过去任意写入的数据(只要它没有被应用程序显式覆盖或删除)。

If you add a new consumer to a messaging system, it typically only starts receiving messages sent after the time it was registered; any prior messages are already gone and cannot be recovered. Contrast this with files and databases, where you can add a new client at any time, and it can read data written arbitrarily far in the past (as long as it has not been explicitly overwritten or deleted by the application).

为什么我们不能将数据库的持久存储方法与消息传递的低延迟通知设施结合起来?这就是基于日志的消息代理背后的想法。

Why can we not have a hybrid, combining the durable storage approach of databases with the low-latency notification facilities of messaging? This is the idea behind log-based message brokers.

使用日志进行消息存储

Using logs for message storage

日志只是磁盘上仅附加的记录序列。我们之前在第 3 章中讨论了日志结构存储引擎和预写日志中的日志,并在第 5 章中讨论了复制中的日志。

A log is simply an append-only sequence of records on disk. We previously discussed logs in the context of log-structured storage engines and write-ahead logs in Chapter 3, and in the context of replication in Chapter 5.

可以使用相同的结构来实现消息代理:生产者通过将消息附加到日志末尾来发送消息,而消费者通过顺序读取日志来接收消息。如果消费者到达日志末尾,它将等待已附加新消息的通知。Unix 工具tail -f监视文件中是否有附加数据,其工作原理基本上是这样的。

The same structure can be used to implement a message broker: a producer sends a message by appending it to the end of the log, and a consumer receives messages by reading the log sequentially. If a consumer reaches the end of the log, it waits for a notification that a new message has been appended. The Unix tool tail -f, which watches a file for data being appended, essentially works like this.

为了扩展到比单个磁盘所能提供的更高的吞吐量,可以对日志进行分区(在第 6 章 的意义上)。然后,不同的分区可以托管在不同的计算机上,使每个分区成为一个单独的日志,可以独立于其他分区进行读写。然后,一个主题可以被定义为一组携带相同类型消息的分区。这种方法如图 11-3所示。

In order to scale to higher throughput than a single disk can offer, the log can be partitioned (in the sense of Chapter 6). Different partitions can then be hosted on different machines, making each partition a separate log that can be read and written independently from other partitions. A topic can then be defined as a group of partitions that all carry messages of the same type. This approach is illustrated in Figure 11-3.

在每个分区内,代理为每条消息分配一个单调递增的序列号或偏移量(在图 11-3中,框中的数字是消息偏移量)。这样的序列号是有意义的,因为分区是仅附加的,因此分区内的消息是完全排序的。不同分区之间不存在顺序保证。

Within each partition, the broker assigns a monotonically increasing sequence number, or offset, to every message (in Figure 11-3, the numbers in boxes are message offsets). Such a sequence number makes sense because a partition is append-only, so the messages within a partition are totally ordered. There is no ordering guarantee across different partitions.

迪迪亚1103
图 11-3。生产者通过将消息附加到主题分区文件来发送消息,而消费者则按顺序读取这些文件。

Apache Kafka [ 17 , 18 ]、Amazon Kinesis Streams [ 19 ] 和 Twitter 的 DistributedLog [ 20 , 21 ] 都是基于日志的消息代理,其工作原理如下。Google Cloud Pub/Sub 在架构上类似,但公开 JMS 风格的 API,而不是日志抽象 [ 16 ]。即使这些消息代理将所有消息写入磁盘,它们也能够通过跨多台机器进行分区来实现每秒数百万条消息的吞吐量,并通过复制消息来实现容错 [22 , 23 ]

Apache Kafka [17, 18], Amazon Kinesis Streams [19], and Twitter’s DistributedLog [20, 21] are log-based message brokers that work like this. Google Cloud Pub/Sub is architecturally similar but exposes a JMS-style API rather than a log abstraction [16]. Even though these message brokers write all messages to disk, they are able to achieve throughput of millions of messages per second by partitioning across multiple machines, and fault tolerance by replicating messages [22, 23].

日志与传统消息传递的比较

Logs compared to traditional messaging

基于日志的方法一般支持扇出消息传递,因为多个消费者可以独立读取日志而不会互相影响 - 读取消息不会将其从日志中删除。为了实现一组消费者之间的负载平衡,代理可以将整个分区分配给消费者组中的节点,而不是将单个消息分配给消费者客户端。

The log-based approach trivially supports fan-out messaging, because several consumers can independently read the log without affecting each other—reading a message does not delete it from the log. To achieve load balancing across a group of consumers, instead of assigning individual messages to consumer clients, the broker can assign entire partitions to nodes in the consumer group.

然后,每个客户端都会消耗分配给它的分区中的所有消息。通常,当消费者被分配了一个日志分区时,它会以简单的单线程方式顺序读取分区中的消息。这种粗粒度的负载平衡方法有一些缺点:

Each client then consumes all the messages in the partitions it has been assigned. Typically, when a consumer has been assigned a log partition, it reads the messages in the partition sequentially, in a straightforward single-threaded manner. This coarse-grained load balancing approach has some downsides:

  • 分担消费主题工作的节点数量最多可以是该主题中的日志分区的数量,因为同一分区内的消息被传递到同一节点。

  • The number of nodes sharing the work of consuming a topic can be at most the number of log partitions in that topic, because messages within the same partition are delivered to the same node.i

  • 如果单个消息的处理速度很慢,它就会阻碍该分区中后续消息的处理(一种队头阻塞的形式;请参阅“描述性能”)。

  • If a single message is slow to process, it holds up the processing of subsequent messages in that partition (a form of head-of-line blocking; see “Describing Performance”).

因此,在处理消息的成本可能很高并且您希望逐条消息地并行处理以及消息排序不那么重要的情况下,JMS/AMQP 风格的消息代理是更可取的。另一方面,在消息吞吐量较高的情况下,每条消息的处理速度都很快,并且消息排序很重要,基于日志的方法非常有效。

Thus, in situations where messages may be expensive to process and you want to parallelize processing on a message-by-message basis, and where message ordering is not so important, the JMS/AMQP style of message broker is preferable. On the other hand, in situations with high message throughput, where each message is fast to process and where message ordering is important, the log-based approach works very well.

消费者抵消

Consumer offsets

按顺序消费分区可以很容易地知道哪些消息已被处理:所有偏移量小于消费者当前偏移量的消息都已被处理,而所有具有更大偏移量的消息尚未被看到。因此,代理不需要跟踪每条消息的确认——它只需要定期记录消费者偏移量。这种方法减少的簿记开销以及批处理和流水线的机会有助于提高基于日志的系统的吞吐量。

Consuming a partition sequentially makes it easy to tell which messages have been processed: all messages with an offset less than a consumer’s current offset have already been processed, and all messages with a greater offset have not yet been seen. Thus, the broker does not need to track acknowledgments for every single message—it only needs to periodically record the consumer offsets. The reduced bookkeeping overhead and the opportunities for batching and pipelining in this approach help increase the throughput of log-based systems.

事实上,这个偏移量与单领导者数据库复制中常见的日志序列号非常相似,我们在“设置新追随者”中讨论了这一点。在数据库复制中,日志序列号允许追随者在断开连接后重新连接到领导者,并恢复复制而不跳过任何写入。这里使用了完全相同的原理:消息代理的行为就像领导者数据库,而消费者就像追随者。

This offset is in fact very similar to the log sequence number that is commonly found in single-leader database replication, and which we discussed in “Setting Up New Followers”. In database replication, the log sequence number allows a follower to reconnect to a leader after it has become disconnected, and resume replication without skipping any writes. Exactly the same principle is used here: the message broker behaves like a leader database, and the consumer like a follower.

如果消费者节点发生故障,消费者组中的另一个节点将被分配到发生故障的消费者的分区,并开始在最后记录的偏移处消费消息。如果消费者已经处理了后续消息但尚未记录其偏移量,则这些消息将在重新启动时被第二次处理。我们将在本章后面讨论处理这个问题的方法。

If a consumer node fails, another node in the consumer group is assigned the failed consumer’s partitions, and it starts consuming messages at the last recorded offset. If the consumer had processed subsequent messages but not yet recorded their offset, those messages will be processed a second time upon restart. We will discuss ways of dealing with this issue later in the chapter.

磁盘空间使用情况

Disk space usage

如果您只追加到日志中,您最终将耗尽磁盘空间。为了回收磁盘空间,日志实际上被分成段,并且不时地删除旧段或将其移动到归档存储。(稍后我们将讨论释放磁盘空间的更复杂的方法。)

If you only ever append to the log, you will eventually run out of disk space. To reclaim disk space, the log is actually divided into segments, and from time to time old segments are deleted or moved to archive storage. (We’ll discuss a more sophisticated way of freeing disk space later.)

这意味着,如果一个缓慢的消费者无法跟上消息的传输速度,并且它落后得太远以至于其消费者偏移量指向已删除的段,那么它将错过一些消息。实际上,日志实现了一个有界大小的缓冲区,当它变满时会丢弃旧消息,也称为循环缓冲区环形缓冲区。但是,由于该缓冲区位于磁盘上,因此它可能非常大。

This means that if a slow consumer cannot keep up with the rate of messages, and it falls so far behind that its consumer offset points to a deleted segment, it will miss some of the messages. Effectively, the log implements a bounded-size buffer that discards old messages when it gets full, also known as a circular buffer or ring buffer. However, since that buffer is on disk, it can be quite large.

让我们做一个粗略的计算。在撰写本文时,典型的大型硬盘容量为 6 TB,顺序写入吞吐量为 150 MB/s。如果您以尽可能快的速度写入消息,大约需要 11 个小时才能填满驱动器。因此,磁盘可以缓冲 11 小时的消息,之后它将开始覆盖旧消息。即使您使用许多硬盘驱动器和机器,该比率也保持不变。实际上,部署很少使用磁盘的完整写入带宽,因此日志通常可以保留几天甚至几周的消息的缓冲区。

Let’s do a back-of-the-envelope calculation. At the time of writing, a typical large hard drive has a capacity of 6 TB and a sequential write throughput of 150 MB/s. If you are writing messages at the fastest possible rate, it takes about 11 hours to fill the drive. Thus, the disk can buffer 11 hours’ worth of messages, after which it will start overwriting old messages. This ratio remains the same, even if you use many hard drives and machines. In practice, deployments rarely use the full write bandwidth of the disk, so the log can typically keep a buffer of several days’ or even weeks’ worth of messages.

无论保留消息多久,日志的吞吐量或多或少保持不变,因为每条消息都会写入磁盘[ 18 ]。这种行为与默认情况下将消息保存在内存中并且仅在队列变得太大时才将其写入磁盘的消息传递系统形成对比:此类系统在队列很短时速度很快,而在开始写入磁盘时变得慢得多,因此吞吐量取决于保留的历史记录量。

Regardless of how long you retain messages, the throughput of a log remains more or less constant, since every message is written to disk anyway [18]. This behavior is in contrast to messaging systems that keep messages in memory by default and only write them to disk if the queue grows too large: such systems are fast when queues are short and become much slower when they start writing to disk, so the throughput depends on the amount of history retained.

当消费者跟不上生产者时

When consumers cannot keep up with producers

在“消息系统” 的开头,我们讨论了如果消费者无法跟上生产者发送消息的速率时该怎么做的三种选择:丢弃消息、缓冲或应用背压。在此分类中,基于日志的方法是一种具有较大但固定大小的缓冲区(受可用磁盘空间限制)的缓冲形式。

At the beginning of “Messaging Systems” we discussed three choices of what to do if a consumer cannot keep up with the rate at which producers are sending messages: dropping messages, buffering, or applying backpressure. In this taxonomy, the log-based approach is a form of buffering with a large but fixed-size buffer (limited by the available disk space).

如果消费者落后得太远,以至于它需要的消息比磁盘上保留的消息更旧,那么它将无法读取这些消息,因此代理会有效地丢弃回溯到缓冲区大小无法容纳的旧消息。您可以监控消费者落后于日志头部的距离,并在明显落后时发出警报。由于缓冲区很大,操作员有足够的时间来修复缓慢的消费者,并允许它在开始丢失消息之前赶上。

If a consumer falls so far behind that the messages it requires are older than what is retained on disk, it will not be able to read those messages—so the broker effectively drops old messages that go back further than the size of the buffer can accommodate. You can monitor how far a consumer is behind the head of the log, and raise an alert if it falls behind significantly. As the buffer is large, there is enough time for a human operator to fix the slow consumer and allow it to catch up before it starts missing messages.

即使某个消费者确实落后太多并开始丢失消息,也只有该消费者受到影响;它不会干扰其他消费者的服务。这一事实是一个巨大的运营优势:您可以实验性地使用生产日志用于开发、测试或调试目的,而不必太担心中断生产服务。当消费者关闭或崩溃时,它会停止消耗资源——唯一剩下的就是它的消费者偏移量。

Even if a consumer does fall too far behind and starts missing messages, only that consumer is affected; it does not disrupt the service for other consumers. This fact is a big operational advantage: you can experimentally consume a production log for development, testing, or debugging purposes, without having to worry much about disrupting production services. When a consumer is shut down or crashes, it stops consuming resources—the only thing that remains is its consumer offset.

这种行为也与传统的消息代理形成鲜明对比,在传统的消息代理中,您需要小心删除消费者已关闭的任何队列,否则它们会继续不必要地累积消息并从仍处于活动状态的消费者中夺走内存。

This behavior also contrasts with traditional message brokers, where you need to be careful to delete any queues whose consumers have been shut down—otherwise they continue unnecessarily accumulating messages and taking away memory from consumers that are still active.

重播旧消息

Replaying old messages

我们之前注意到,对于 AMQP 和 JMS 风格的消息代理,处理和确认消息是一种破坏性操作,因为它会导致消息在代理上被删除。另一方面,在基于日志的消息代理中,消费消息更像是从文件中读取:它是一种只读操作,不会更改日志。

We noted previously that with AMQP- and JMS-style message brokers, processing and acknowledging messages is a destructive operation, since it causes the messages to be deleted on the broker. On the other hand, in a log-based message broker, consuming messages is more like reading from a file: it is a read-only operation that does not change the log.

除了消费者的任何输出之外,处理的唯一副作用是消费者偏移量向前移动。但是偏移量是在消费者的控制之下的,因此如果需要的话可以很容易地对其进行操作:例如,您可以使用昨天的偏移量启动消费者的副本并将输出写入到不同的位置,以便重新处理最后一天的值消息。您可以重复此操作任意多次,并改变处理代码。

The only side effect of processing, besides any output of the consumer, is that the consumer offset moves forward. But the offset is under the consumer’s control, so it can easily be manipulated if necessary: for example, you can start a copy of a consumer with yesterday’s offsets and write the output to a different location, in order to reprocess the last day’s worth of messages. You can repeat this any number of times, varying the processing code.

这方面使得基于日志的消息传递更像上一章的批处理,其中通过可重复的转换过程将派生数据与输入数据清楚地分开。它允许进行更多实验,并且更容易从错误和错误中恢复,使其成为在组织内集成数据流的良好工具[ 24 ]。

This aspect makes log-based messaging more like the batch processes of the last chapter, where derived data is clearly separated from input data through a repeatable transformation process. It allows more experimentation and easier recovery from errors and bugs, making it a good tool for integrating dataflows within an organization [24].

数据库和流

Databases and Streams

我们对消息代理和数据库进行了一些比较。尽管传统上它们被认为是不同类别的工具,但我们看到基于日志的消息代理已经成功地从数据库中获取想法并将其应用到消息传递中。我们也可以反过来:从消息传递和流中获取想法,并将其应用到数据库中。

We have drawn some comparisons between message brokers and databases. Even though they have traditionally been considered separate categories of tools, we saw that log-based message brokers have been successful in taking ideas from databases and applying them to messaging. We can also go in reverse: take ideas from messaging and streams, and apply them to databases.

我们之前说过,事件是在某个时间点发生的事情的记录。发生的事情可能是用户操作(例如,键入搜索查询)或传感器读数,但也可能是对数据库的写入。事实上,某些内容被写入数据库是一个可以捕获、存储和处理的事件。这一观察结果表明,数据库和流之间的连接比磁盘上日志的物理存储更深,这是非常基本的。

We said previously that an event is a record of something that happened at some point in time. The thing that happened may be a user action (e.g., typing a search query), or a sensor reading, but it may also be a write to a database. The fact that something was written to a database is an event that can be captured, stored, and processed. This observation suggests that the connection between databases and streams runs deeper than just the physical storage of logs on disk—it is quite fundamental.

事实上,复制日志(请参阅“复制日志的实现”)是数据库写入事件流,由领导者在处理事务时生成。追随者将该写入流应用到他们自己的数据库副本,从而最终获得相同数据的准确副本。复制日志中的事件描述发生的数据更改。

In fact, a replication log (see “Implementation of Replication Logs”) is a stream of database write events, produced by the leader as it processes transactions. The followers apply that stream of writes to their own copy of the database and thus end up with an accurate copy of the same data. The events in the replication log describe the data changes that occurred.

我们还在“全序广播”中 遇到了状态机复制原理,其中指出:如果每个事件都代表对数据库的写入,并且每个副本以相同的顺序处理相同的事件,那么这些副本最终都会在相同的最终状态。(处理事件被假定为确定性操作。)这只是事件流的另一种情况!

We also came across the state machine replication principle in “Total Order Broadcast”, which states: if every event represents a write to the database, and every replica processes the same events in the same order, then the replicas will all end up in the same final state. (Processing an event is assumed to be a deterministic operation.) It’s just another case of event streams!

在本节中,我们将首先研究异构数据系统中出现的问题,然后探讨如何通过将事件流中的想法引入数据库来解决该问题。

In this section we will first look at a problem that arises in heterogeneous data systems, and then explore how we can solve it by bringing ideas from event streams to databases.

保持系统同步

Keeping Systems in Sync

正如我们在本书中所看到的,没有一个系统可以满足所有数据存储、查询和处理需求。在实践中,大多数重要的应用程序需要结合多种不同的技术来满足其需求:例如,使用 OLTP 数据库来服务用户请求,使用缓存来加速常见请求,使用全文索引来处理搜索查询,以及用于分析的数据仓库。其中每一个都有自己的数据副本,以针对其自身目的进行优化的自己的表示形式存储。

As we have seen throughout this book, there is no single system that can satisfy all data storage, querying, and processing needs. In practice, most nontrivial applications need to combine several different technologies in order to satisfy their requirements: for example, using an OLTP database to serve user requests, a cache to speed up common requests, a full-text index to handle search queries, and a data warehouse for analytics. Each of these has its own copy of the data, stored in its own representation that is optimized for its own purposes.

由于相同或相关的数据出现在多个不同的地方,因此它们需要保持彼此同步:如果数据库中更新了一项,那么它也需要在缓存、搜索索引和数据仓库中更新。对于数据仓库,这种同步通常由 ETL 流程(请参阅“数据仓库” )执行,通常通过获取数据库的完整副本、对其进行转换并将其批量加载到数据仓库中(换句话说,即批处理)。同样,我们在 “批处理工作流程的输出”中看到了如何使用批处理流程创建搜索索引、推荐系统和其他派生数据系统。

As the same or related data appears in several different places, they need to be kept in sync with one another: if an item is updated in the database, it also needs to be updated in the cache, search indexes, and data warehouse. With data warehouses this synchronization is usually performed by ETL processes (see “Data Warehousing”), often by taking a full copy of a database, transforming it, and bulk-loading it into the data warehouse—in other words, a batch process. Similarly, we saw in “The Output of Batch Workflows” how search indexes, recommendation systems, and other derived data systems might be created using batch processes.

如果定期完整数据库转储太慢,有时使用的替代方案是双重写入,其中应用程序代码在数据更改时显式写入每个系统:例如,首先写入数据库,然后更新搜索索引,然后使缓存条目无效(甚至同时执行这些写入)。

If periodic full database dumps are too slow, an alternative that is sometimes used is dual writes, in which the application code explicitly writes to each of the systems when data changes: for example, first writing to the database, then updating the search index, then invalidating the cache entries (or even performing those writes concurrently).

然而,双重写入有一些严重的问题,其中之一就是 图 11-4所示的竞争条件。在这个例子中,两个客户端同时想要更新一个项目X:客户端1想要将值设置为A,客户端2想要将其设置为B。两个客户端首先将新值写入数据库,然后将其写入数据库搜索索引。由于时间不凑巧,请求是交错的:数据库首先看到客户端 1 的写入将值设置为 A,然后客户端 2 的写入将值设置为 B,因此数据库中的最终值为 B。 搜索索引首先看到客户端 2 的写入,然后是客户端 1,因此搜索索引中的最终值为 A。即使没有发生错误,两个系统现在彼此永久不一致。

However, dual writes have some serious problems, one of which is a race condition illustrated in Figure 11-4. In this example, two clients concurrently want to update an item X: client 1 wants to set the value to A, and client 2 wants to set it to B. Both clients first write the new value to the database, then write it to the search index. Due to unlucky timing, the requests are interleaved: the database first sees the write from client 1 setting the value to A, then the write from client 2 setting the value to B, so the final value in the database is B. The search index first sees the write from client 2, then client 1, so the final value in the search index is A. The two systems are now permanently inconsistent with each other, even though no error occurred.

迪迪亚1104
图 11-4。在数据库中,X 首先设置为 A,然后设置为 B,而在搜索索引处,写入以相反的顺序到达。

除非您有一些额外的并发检测机制,例如我们在“检测并发写入” 中讨论的版本向量,否则您甚至不会注意到并发写入的发生 - 一个值只会默默地覆盖另一个值。

Unless you have some additional concurrency detection mechanism, such as the version vectors we discussed in “Detecting Concurrent Writes”, you will not even notice that concurrent writes occurred—one value will simply silently overwrite another value.

双写入的另一个问题是其中一个写入可能会失败,而另一个会成功。这是一个容错问题而不是并发问题,但它也会导致两个系统变得不一致。确保它们要么都成功要么都失败是原子提交问题的一个例子,解决这个问题的成本很高(请参阅 “原子提交和两阶段提交(2PC)”)。

Another problem with dual writes is that one of the writes may fail while the other succeeds. This is a fault-tolerance problem rather than a concurrency problem, but it also has the effect of the two systems becoming inconsistent with each other. Ensuring that they either both succeed or both fail is a case of the atomic commit problem, which is expensive to solve (see “Atomic Commit and Two-Phase Commit (2PC)”).

如果您只有一个具有单个领导者的复制数据库,则该领导者决定写入顺序,因此状态机复制方法可以在数据库的副本之间运行。然而,在图11-4中,没有一个领导者:数据库可能有一个领导者,搜索索引也可能有一个领导者,但两者都不跟随另一个,因此可能会发生冲突(参见“多领导者复制 ) 。

If you only have one replicated database with a single leader, then that leader determines the order of writes, so the state machine replication approach works among replicas of the database. However, in Figure 11-4 there isn’t a single leader: the database may have a leader and the search index may have a leader, but neither follows the other, and so conflicts can occur (see “Multi-Leader Replication”).

如果真的只有一个领导者(例如数据库)并且我们可以使搜索索引成为数据库的跟随者,情况会更好。但这在实践中可能吗?

The situation would be better if there really was only one leader—for example, the database—and if we could make the search index a follower of the database. But is this possible in practice?

变更数据捕获

Change Data Capture

大多数数据库的复制日志的问题在于,它们长期以来被认为是数据库的内部实现细节,而不是公共 API。客户端应该通过数据库的数据模型和查询语言来查询数据库,而不是解析复制日志并尝试从中提取数据。

The problem with most databases’ replication logs is that they have long been considered to be an internal implementation detail of the database, not a public API. Clients are supposed to query the database through its data model and query language, not parse the replication logs and try to extract data from them.

几十年来,许多数据库根本没有记录的方法来将更改日志写入其中。因此,很难将数据库中所做的所有更改复制到不同的存储技术(例如搜索索引、缓存或数据仓库)。

For decades, many databases simply did not have a documented way of getting the log of changes written to them. For this reason it was difficult to take all the changes made in a database and replicate them to a different storage technology such as a search index, cache, or data warehouse.

最近,人们对变更数据捕获(CDC) 越来越感兴趣,这是观察写入数据库的所有数据变更并以可复制到其他系统的形式提取它们的过程。如果更改在写入后立即以流的形式提供,那么 CDC 会特别有趣。

More recently, there has been growing interest in change data capture (CDC), which is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to other systems. CDC is especially interesting if changes are made available as a stream, immediately as they are written.

例如,您可以捕获数据库中的更改并不断将相同的更改应用于搜索索引。如果以相同的顺序应用更改日志,则可以预期搜索索引中的数据与数据库中的数据相匹配。搜索索引和任何其他派生数据系统只是变更流的消费者,如图11-5所示。

For example, you can capture the changes in a database and continually apply the same changes to a search index. If the log of changes is applied in the same order, you can expect the data in the search index to match the data in the database. The search index and any other derived data systems are just consumers of the change stream, as illustrated in Figure 11-5.

迪迪亚1105
图 11-5。按照将数据写入一个数据库的顺序获取数据,并以相同的顺序将更改应用到其他系统。

实施变更数据捕获

Implementing change data capture

我们可以将日志消费者称为派生数据系统,正如第三部分的介绍中所讨论的 :存储在搜索索引和数据仓库中的数据只是记录系统中数据的另一种视图。更改数据捕获是一种机制,用于确保对记录系统所做的所有更改也反映在派生数据系统中,以便派生系统具有准确的数据副本。

We can call the log consumers derived data systems, as discussed in the introduction to Part III: the data stored in the search index and the data warehouse is just another view onto the data in the system of record. Change data capture is a mechanism for ensuring that all changes made to the system of record are also reflected in the derived data systems so that the derived systems have an accurate copy of the data.

从本质上讲,变更数据捕获使一个数据库成为领导者(捕获变更的数据库),并将其他数据库转变为追随者。基于日志的消息代理非常适合传输来自源数据库的更改事件,因为它保留了消息的顺序(避免了图 11-2中的重新排序问题)。

Essentially, change data capture makes one database the leader (the one from which the changes are captured), and turns the others into followers. A log-based message broker is well suited for transporting the change events from the source database, since it preserves the ordering of messages (avoiding the reordering issue of Figure 11-2).

数据库触发器可用于通过注册观察数据表所有更改并将相应条目添加到更改日志表的触发器来实现更改数据捕获(请参阅“基于触发器的复制” )。然而,它们往往很脆弱并且具有显着的性能开销。解析复制日志可能是一种更稳健的方法,尽管它也面临着挑战,例如处理架构更改。

Database triggers can be used to implement change data capture (see “Trigger-based replication”) by registering triggers that observe all changes to data tables and add corresponding entries to a changelog table. However, they tend to be fragile and have significant performance overheads. Parsing the replication log can be a more robust approach, although it also comes with challenges, such as handling schema changes.

LinkedIn 的 Databus [ 25 ]、Facebook 的 Wormhole [ 26 ] 和 Yahoo! 的 Sherpa [ 27 ] 大规模使用了这个想法。Bottled Water 使用解码预写日志 [ 28 ]的 API 为 PostgreSQL 实现 CDC ,Maxwell 和 Debezium 通过解析 binlog [ 29 , 30 , 31 ] 为 MySQL 做类似的事情,Mongoriver 读取 MongoDB oplog [ 32 , 33 ] ,并且 GoldenGate 为 Oracle 提供了类似的设施 [ 34 , 35 ]。

LinkedIn’s Databus [25], Facebook’s Wormhole [26], and Yahoo!’s Sherpa [27] use this idea at large scale. Bottled Water implements CDC for PostgreSQL using an API that decodes the write-ahead log [28], Maxwell and Debezium do something similar for MySQL by parsing the binlog [29, 30, 31], Mongoriver reads the MongoDB oplog [32, 33], and GoldenGate provides similar facilities for Oracle [34, 35].

与消息代理一样,更改数据捕获通常是异步的:记录数据库系统在提交更改之前不会等待更改应用到使用者。这种设计的操作优势是添加慢速消费者不会对记录系统产生太大影响,但它的缺点是所有复制滞后问题都适用(请参阅“复制滞后问题”)。

Like message brokers, change data capture is usually asynchronous: the system of record database does not wait for the change to be applied to consumers before committing it. This design has the operational advantage that adding a slow consumer does not affect the system of record too much, but it has the downside that all the issues of replication lag apply (see “Problems with Replication Lag”).

初始快照

Initial snapshot

如果您拥有对数据库所做的所有更改的日志,则可以通过重播该日志来重建数据库的整个状态。然而,在许多情况下,永久保留所有更改将需要太多磁盘空间,并且重播将花费太长时间,因此需要截断日志。

If you have the log of all changes that were ever made to a database, you can reconstruct the entire state of the database by replaying the log. However, in many cases, keeping all changes forever would require too much disk space, and replaying it would take too long, so the log needs to be truncated.

例如,构建新的全文索引需要整个数据库的完整副本,仅应用最近更改的日志是不够的,因为它会丢失最近未更新的项目。因此,如果您没有完整的日志历史记录,则需要从一致的快照开始,如前面“设置新关注者”中所述。

Building a new full-text index, for example, requires a full copy of the entire database—it is not sufficient to only apply a log of recent changes, since it would be missing items that were not recently updated. Thus, if you don’t have the entire log history, you need to start with a consistent snapshot, as previously discussed in “Setting Up New Followers”.

数据库的快照必须对应于更改日志中的已知位置或偏移量,以便您知道在处理快照后从哪个点开始应用更改。一些 CDC 工具集成了此快照功能,而其他工具则将其保留为手动操作。

The snapshot of the database must correspond to a known position or offset in the change log, so that you know at which point to start applying changes after the snapshot has been processed. Some CDC tools integrate this snapshot facility, while others leave it as a manual operation.

日志压缩

Log compaction

如果只能保留有限数量的日志历史记录,则每次要添加新的派生数据系统时都需要执行快照过程。然而,日志压缩提供了一个很好的选择。

If you can only keep a limited amount of log history, you need to go through the snapshot process every time you want to add a new derived data system. However, log compaction provides a good alternative.

我们之前在“哈希索引”中讨论过日志压缩,在日志结构存储引擎的背景下(参见图 3-2的示例)。原理很简单:存储引擎定期查找具有相同键的日志记录,丢弃任何重复项,并仅保留每个键的最新更新。此压缩和合并过程在后台运行。

We discussed log compaction previously in “Hash Indexes”, in the context of log-structured storage engines (see Figure 3-2 for an example). The principle is simple: the storage engine periodically looks for log records with the same key, throws away any duplicates, and keeps only the most recent update for each key. This compaction and merging process runs in the background.

在日志结构存储引擎中,具有特殊空值(逻辑删除)的更新表示某个键已被删除,并导致它在日志压缩期间被删除。但只要密钥未被覆盖或删除,它就会永远保留在日志中。这种压缩日志所需的磁盘空间仅取决于数据库的当前内容,而不取决于数据库中曾经发生的写入次数。如果同一个键频繁被覆盖,之前的值最终会被垃圾回收,只保留最新的值。

In a log-structured storage engine, an update with a special null value (a tombstone) indicates that a key was deleted, and causes it to be removed during log compaction. But as long as a key is not overwritten or deleted, it stays in the log forever. The disk space required for such a compacted log depends only on the current contents of the database, not the number of writes that have ever occurred in the database. If the same key is frequently overwritten, previous values will eventually be garbage-collected, and only the latest value will be retained.

同样的想法适用于基于日志的消息代理和变更数据捕获的上下文。如果 CDC 系统设置为每次更改都有一个主键,并且每次更新键都会替换该键的先前值,那么仅保留特定键的最新写入就足够了。

The same idea works in the context of log-based message brokers and change data capture. If the CDC system is set up such that every change has a primary key, and every update for a key replaces the previous value for that key, then it’s sufficient to keep just the most recent write for a particular key.

现在,每当您想要重建派生数据系统(例如搜索索引)时,您都可以从日志压缩主题的偏移量 0 开始一个新的消费者,并顺序扫描日志中的所有消息。日志保证包含数据库中每个键的最新值(可能还有一些较旧的值),换句话说,您可以使用它来获取数据库内容的完整副本,而无需拍摄 CDC 的另一个快照源数据库。

Now, whenever you want to rebuild a derived data system such as a search index, you can start a new consumer from offset 0 of the log-compacted topic, and sequentially scan over all messages in the log. The log is guaranteed to contain the most recent value for every key in the database (and maybe some older values)—in other words, you can use it to obtain a full copy of the database contents without having to take another snapshot of the CDC source database.

Apache Kafka 支持此日志压缩功能。正如我们将在本章后面看到的,它允许消息代理用于持久存储,而不仅仅是用于瞬时消息传递。

This log compaction feature is supported by Apache Kafka. As we shall see later in this chapter, it allows the message broker to be used for durable storage, not just for transient messaging.

对变更流的 API 支持

API support for change streams

数据库越来越多地开始支持变更流作为一流的接口,而不是典型的改造和逆向工程 CDC 工作。例如,RethinkDB 允许查询在查询结果更改时订阅通知 [ 36 ],Firebase [ 37 ] 和 CouchDB [ 38 ] 基于也可供应用程序使用的更改源提供数据同步,而 Meteor 使用MongoDB oplog 订阅数据更改并更新用户界面[ 39 ]。

Increasingly, databases are beginning to support change streams as a first-class interface, rather than the typical retrofitted and reverse-engineered CDC efforts. For example, RethinkDB allows queries to subscribe to notifications when the results of a query change [36], Firebase [37] and CouchDB [38] provide data synchronization based on a change feed that is also made available to applications, and Meteor uses the MongoDB oplog to subscribe to data changes and update the user interface [39].

VoltDB 允许事务以流的形式连续从数据库导出数据[ 40 ]。数据库将关系数据模型中的输出流表示为一个表,事务可以在其中插入元组,但不能查询该表。然后,该流包含提交事务已按照提交顺序写入此特殊表的元组日志。外部使用者可以异步使用此日志并使用它来更新派生数据系统。

VoltDB allows transactions to continuously export data from a database in the form of a stream [40]. The database represents an output stream in the relational data model as a table into which transactions can insert tuples, but which cannot be queried. The stream then consists of the log of tuples that committed transactions have written to this special table, in the order they were committed. External consumers can asynchronously consume this log and use it to update derived data systems.

Kafka Connect [ 41 ] 致力于将各种数据库系统的变更数据捕获工具与 Kafka 集成。一旦更改事件流位于 Kafka 中,它就可以用于更新派生数据系统(例如搜索索引),也可以输入到流处理系统中,如本章后面讨论的那样。

Kafka Connect [41] is an effort to integrate change data capture tools for a wide range of database systems with Kafka. Once the stream of change events is in Kafka, it can be used to update derived data systems such as search indexes, and also feed into stream processing systems as discussed later in this chapter.

事件溯源

Event Sourcing

我们在这里讨论的想法和事件溯源 之间有一些相似之处,事件溯源是领域驱动设计 (DDD) 社区中开发的一种技术[ 42,43,44 ]。我们将简要讨论事件溯源,因为它包含了一些针对流系统的有用且相关的想法。

There are some parallels between the ideas we’ve discussed here and event sourcing, a technique that was developed in the domain-driven design (DDD) community [42, 43, 44]. We will discuss event sourcing briefly, because it incorporates some useful and relevant ideas for streaming systems.

与变更数据捕获类似,事件溯源涉及将应用程序状态的所有变更存储为变更事件日志。最大的区别是事件溯源在不同的抽象级别应用了这个想法:

Similarly to change data capture, event sourcing involves storing all changes to the application state as a log of change events. The biggest difference is that event sourcing applies the idea at a different level of abstraction:

  • 在变更数据捕获中,应用程序以可变的方式使用数据库,随意更新和删除记录。更改日志是在较低级别从数据库中提取的(例如,通过解析复制日志),这确保了从数据库中提取的写入顺序与实际写入的顺序相匹配,避免了图中的竞争 条件11-4 . 写入数据库的应用程序不需要知道 CDC 正在发生。

  • In change data capture, the application uses the database in a mutable way, updating and deleting records at will. The log of changes is extracted from the database at a low level (e.g., by parsing the replication log), which ensures that the order of writes extracted from the database matches the order in which they were actually written, avoiding the race condition in Figure 11-4. The application writing to the database does not need to be aware that CDC is occurring.

  • 在事件溯源中,应用程序逻辑显式构建在写入事件日志的不可变事件的基础上。在这种情况下,事件存储是仅追加的,并且不鼓励或禁止更新或删除。事件旨在反映应用程序级别发生的事情,而不是低级别的状态更改。

  • In event sourcing, the application logic is explicitly built on the basis of immutable events that are written to an event log. In this case, the event store is append-only, and updates or deletes are discouraged or prohibited. Events are designed to reflect things that happened at the application level, rather than low-level state changes.

事件溯源是一种强大的数据建模技术:从应用程序的角度来看,将用户的操作记录为不可变事件比在可变数据库上记录这些操作的效果更有意义。事件溯源使得随着时间的推移更容易发展应用程序,通过使事后更容易理解发生某些事情的原因来帮助调试,并防止应用程序错误(请参阅“不可变事件的优点”)。

Event sourcing is a powerful technique for data modeling: from an application point of view it is more meaningful to record the user’s actions as immutable events, rather than recording the effect of those actions on a mutable database. Event sourcing makes it easier to evolve applications over time, helps with debugging by making it easier to understand after the fact why something happened, and guards against application bugs (see “Advantages of immutable events”).

例如,存储事件“学生取消课程注册”以中立的方式清楚地表达了单个操作的意图,而副作用“从注册表中删除了一个条目,并且在学生反馈中添加了一个取消原因”表”嵌入了许多关于数据稍后使用方式的假设。如果引入新的应用程序功能(例如,“将位置提供给等待名单上的下一个人”),事件溯源方法允许将新的副作用轻松地与现有事件链接起来。

For example, storing the event “student cancelled their course enrollment” clearly expresses the intent of a single action in a neutral fashion, whereas the side effects “one entry was deleted from the enrollments table, and one cancellation reason was added to the student feedback table” embed a lot of assumptions about the way the data is later going to be used. If a new application feature is introduced—for example, “the place is offered to the next person on the waiting list”—the event sourcing approach allows that new side effect to easily be chained off the existing event.

事件溯源类似于编年史数据模型[ 45 ],并且事件日志和星型模式中的事实表之间也有相似之处(请参阅 “星星和雪花:分析模式”)。

Event sourcing is similar to the chronicle data model [45], and there are also similarities between an event log and the fact table that you find in a star schema (see “Stars and Snowflakes: Schemas for Analytics”).

诸如事件存储[ 46 ]之类的专用数据库已经被开发出来,以支持使用事件源的应用程序,但一般来说,该方法独立于任何特定工具。传统的数据库或基于日志的消息代理也可以用于构建这种风格的应用程序。

Specialized databases such as Event Store [46] have been developed to support applications using event sourcing, but in general the approach is independent of any particular tool. A conventional database or a log-based message broker can also be used to build applications in this style.

从事件日志中获取当前状态

Deriving current state from the event log

事件日志本身并不是很有用,因为用户通常希望看到系统的当前状态,而不是修改的历史记录。例如,在购物网站上,用户希望能够看到购物车的当前内容,而不是他们对购物车所做的所有更改的仅附加列表。

An event log by itself is not very useful, because users generally expect to see the current state of a system, not the history of modifications. For example, on a shopping website, users expect to be able to see the current contents of their cart, not an append-only list of all the changes they have ever made to their cart.

因此,使用事件源的应用程序需要获取事件日志(表示写入系统 的数据 )并将其转换为适合向用户显示的应用程序状态(从系统读取数据的方式[ 47] ])。此转换可以使用任意逻辑,但它应该是确定性的,以便您可以再次运行它并从事件日志中派生相同的应用程序状态。

Thus, applications that use event sourcing need to take the log of events (representing the data written to the system) and transform it into application state that is suitable for showing to a user (the way in which data is read from the system [47]). This transformation can use arbitrary logic, but it should be deterministic so that you can run it again and derive the same application state from the event log.

与更改数据捕获一样,重放事件日志允许您重建系统的当前状态。然而,日志压缩需要以不同的方式处理:

Like with change data capture, replaying the event log allows you to reconstruct the current state of the system. However, log compaction needs to be handled differently:

  • 用于更新记录的 CDC 事件通常包含该记录的整个新版本,因此主键的当前值完全由该主键的最新事件决定,并且日志压缩可以丢弃同一主键的先前事件钥匙。

  • A CDC event for the update of a record typically contains the entire new version of the record, so the current value for a primary key is entirely determined by the most recent event for that primary key, and log compaction can discard previous events for the same key.

  • 另一方面,通过事件源,事件在更高级别上建模:事件通常表达用户操作的意图,而不是由于操作而发生的状态更新的机制。在这种情况下,后来的事件通常不会覆盖先前的事件,因此您需要事件的完整历史记录来重建最终状态。日志压缩不可能以同样的方式进行。

  • On the other hand, with event sourcing, events are modeled at a higher level: an event typically expresses the intent of a user action, not the mechanics of the state update that occurred as a result of the action. In this case, later events typically do not override prior events, and so you need the full history of events to reconstruct the final state. Log compaction is not possible in the same way.

使用事件源的应用程序通常具有某种机制来存储从事件日志中派生的当前状态的快照,因此它们不需要重复地重新处理完整日志。然而,这只是一种性能优化,旨在加快读取速度和崩溃恢复速度;目的是系统能够永久存储所有原始事件,并在需要时重新处理完整的事件日志。我们在 “不变性的局限性”中讨论了这个假设。

Applications that use event sourcing typically have some mechanism for storing snapshots of the current state that is derived from the log of events, so they don’t need to repeatedly reprocess the full log. However, this is only a performance optimization to speed up reads and recovery from crashes; the intention is that the system is able to store all raw events forever and reprocess the full event log whenever required. We discuss this assumption in “Limitations of immutability”.

命令和事件

Commands and events

事件溯源哲学仔细区分事件命令 [ 48 ]。当来自用户的请求首次到达时,它最初是一个命令:此时它仍然可能失败,例如因为违反了某些完整性条件。应用程序必须首先验证它是否可以执行该命令。如果验证成功并且命令被接受,它就成为一个事件,该事件是持久且不可变的。

The event sourcing philosophy is careful to distinguish between events and commands [48]. When a request from a user first arrives, it is initially a command: at this point it may still fail, for example because some integrity condition is violated. The application must first validate that it can execute the command. If the validation is successful and the command is accepted, it becomes an event, which is durable and immutable.

例如,如果用户尝试注册特定的用户名,或者预订飞机或剧院的座位,则应用程序需要检查该用户名或座位是否尚未被占用。(我们之前在“容错共识”中讨论过这个示例。)当检查成功时,应用程序可以生成一个事件来指示特定用户 ID 注册了特定用户名,或者已为特定用户保留了特定席位。特定客户。

For example, if a user tries to register a particular username, or reserve a seat on an airplane or in a theater, then the application needs to check that the username or seat is not already taken. (We previously discussed this example in “Fault-Tolerant Consensus”.) When that check has succeeded, the application can generate an event to indicate that a particular username was registered by a particular user ID, or that a particular seat has been reserved for a particular customer.

当事件产生时,它就成为事实。即使客户后来决定更改或取消预订,他们之前保留特定座位的事实仍然成立,并且更改或取消是稍后添加的单独事件。

At the point when the event is generated, it becomes a fact. Even if the customer later decides to change or cancel the reservation, the fact remains true that they formerly held a reservation for a particular seat, and the change or cancellation is a separate event that is added later.

事件流的使用者不允许拒绝事件:当使用者看到事件时,它已经是日志的不可变部分,并且可能已经被其他使用者看到。因此,命令的任何验证都需要在它成为事件之前同步发生,例如,通过使用可原子地验证命令并发布事件的可序列化事务。

A consumer of the event stream is not allowed to reject an event: by the time the consumer sees the event, it is already an immutable part of the log, and it may have already been seen by other consumers. Thus, any validation of a command needs to happen synchronously, before it becomes an event—for example, by using a serializable transaction that atomically validates the command and publishes the event.

或者,用户预订座位的请求可以分为两个事件:首先是临时预订,然后是预订验证后的单独确认事件(如“使用全序广播实现线性化存储”中所述。这种拆分允许在异步过程中进行验证。

Alternatively, the user request to reserve a seat could be split into two events: first a tentative reservation, and then a separate confirmation event once the reservation has been validated (as discussed in “Implementing linearizable storage using total order broadcast”). This split allows the validation to take place in an asynchronous process.

状态、流和不变性

State, Streams, and Immutability

我们在第 10 章 中看到,批处理受益于其输入文件的不变性,因此您可以对现有输入文件运行实验性处理作业,而不必担心损坏它们。这种不变性原则也是事件源和变更数据捕获如此强大的原因。

We saw in Chapter 10 that batch processing benefits from the immutability of its input files, so you can run experimental processing jobs on existing input files without fear of damaging them. This principle of immutability is also what makes event sourcing and change data capture so powerful.

我们通常认为数据库存储应用程序的当前状态 - 这种表示形式针对读取进行了优化,并且通常最方便提供查询服务。状态的本质是它会变化,因此数据库支持更新、删除数据以及插入数据。这如何与不变性相适应?

We normally think of databases as storing the current state of the application—this representation is optimized for reads, and it is usually the most convenient for serving queries. The nature of state is that it changes, so databases support updating and deleting data as well as inserting it. How does this fit with immutability?

每当状态发生变化时,该状态都是随着时间的推移而发生变异的事件的结果。例如,当前可用座位列表是您已处理的预订的结果,当前帐户余额是帐户上贷方和借方的结果,而您的 Web 服务器的响应时间图是各个单独座位的聚合。已发生的所有 Web 请求的响应时间。

Whenever you have state that changes, that state is the result of the events that mutated it over time. For example, your list of currently available seats is the result of the reservations you have processed, the current account balance is the result of the credits and debits on the account, and the response time graph for your web server is an aggregation of the individual response times of all web requests that have occurred.

无论状态如何变化,总会有一系列事件导致这些变化。即使事情已经完成和撤消,这些事件发生的事实仍然是真实的。关键思想是可变状态和不可变事件的仅附加日志并不相互矛盾:它们是同一枚硬币的两个方面。所有更改的日志(changelog)代表状态随时间的演变。

No matter how the state changes, there was always a sequence of events that caused those changes. Even as things are done and undone, the fact remains true that those events occurred. The key idea is that mutable state and an append-only log of immutable events do not contradict each other: they are two sides of the same coin. The log of all changes, the changelog, represents the evolution of state over time.

如果您擅长数学,您可能会说应用程序状态是随时间积分事件流时得到的结果,而更改流是按时间微分状态时得到的结果,如图11-6 所示49、50、51 ]。_ _ _ 这个类比有局限性(例如,状态的二阶导数似乎没有意义),但它是思考数据的一个有用的起点。

If you are mathematically inclined, you might say that the application state is what you get when you integrate an event stream over time, and a change stream is what you get when you differentiate the state by time, as shown in Figure 11-6 [49, 50, 51]. The analogy has limitations (for example, the second derivative of state does not seem to be meaningful), but it’s a useful starting point for thinking about data.

迪迪亚1106
图 11-6。当前应用程序状态和事件流之间的关系。

如果持久存储变更日志,则只会产生使状态可重现的效果。如果您将事件日志视为您的记录系统,并且将任何可变状态视为源自它,那么就可以更轻松地推理系统中的数据流。正如 Pat Helland 所说 [ 52 ]:

If you store the changelog durably, that simply has the effect of making the state reproducible. If you consider the log of events to be your system of record, and any mutable state as being derived from it, it becomes easier to reason about the flow of data through a system. As Pat Helland puts it [52]:

事务日志记录对数据库所做的所有更改。高速追加是更改日志的唯一方法。从这个角度来看,数据库的内容保存了日志中最新记录值的缓存。事实就是日志。数据库是日志子集的缓存。该缓存的子集恰好是日志中每个记录和索引值的最新值。

Transaction logs record all the changes made to the database. High-speed appends are the only way to change the log. From this perspective, the contents of the database hold a caching of the latest record values in the logs. The truth is the log. The database is a cache of a subset of the log. That cached subset happens to be the latest value of each record and index value from the log.

正如“日志压缩”中所讨论的,日志压缩是弥合日志和数据库状态之间区别的一种方法:它仅保留每条记录的最新版本,并丢弃覆盖的版本。

Log compaction, as discussed in “Log compaction”, is one way of bridging the distinction between log and database state: it retains only the latest version of each record, and discards overwritten versions.

不可变事件的优点

Advantages of immutable events

数据库的不变性是一个古老的想法。例如,几个世纪以来,会计师一直在财务簿记中使用不变性。当交易发生时,它被记录在一个仅附加的 分类账中,它本质上是描述金钱、商品或服务易手的事件日志。损益表或资产负债表等账户是通过将分类账中的交易相加而得出的[ 53 ]。

Immutability in databases is an old idea. For example, accountants have been using immutability for centuries in financial bookkeeping. When a transaction occurs, it is recorded in an append-only ledger, which is essentially a log of events describing money, goods, or services that have changed hands. The accounts, such as profit and loss or the balance sheet, are derived from the transactions in the ledger by adding them up [53].

如果出现错误,会计师不会删除或更改分类账中的错误交易,而是添加另一笔交易来弥补错误,例如退还错误的费用。不正确的交易仍然永远保留在分类账中,因为它对于审计原因可能很重要。如果已经公布了源自错误分类账的错误数据,则下一个会计期间的数据将进行更正。这个过程在会计中是完全正常的[ 54 ]。

If a mistake is made, accountants don’t erase or change the incorrect transaction in the ledger—instead, they add another transaction that compensates for the mistake, for example refunding an incorrect charge. The incorrect transaction still remains in the ledger forever, because it might be important for auditing reasons. If incorrect figures, derived from the incorrect ledger, have already been published, then the figures for the next accounting period include a correction. This process is entirely normal in accounting [54].

尽管这种可审计性在金融系统中特别重要,但对于许多其他不受如此严格监管的系统也有好处。正如“批处理输出的原理”中所讨论的 ,如果您不小心部署了有缺陷的代码,将错误的数据写入数据库,并且该代码能够破坏性地覆盖数据,那么恢复就会困难得多。使用不可变事件的仅附加日志,可以更轻松地诊断发生的情况并从问题中恢复。

Although such auditability is particularly important in financial systems, it is also beneficial for many other systems that are not subject to such strict regulation. As discussed in “Philosophy of batch process outputs”, if you accidentally deploy buggy code that writes bad data to a database, recovery is much harder if the code is able to destructively overwrite data. With an append-only log of immutable events, it is much easier to diagnose what happened and recover from the problem.

不可变事件还捕获更多信息,而不仅仅是当前状态。例如,在购物网站上,客户可以将商品添加到购物车,然后再次将其删除。尽管从订单履行的角度来看,第二个事件抵消了第一个事件,但出于分析目的,了解客户正在考虑某个特定商品但随后决定不购买该商品可能会很有用。也许他们将来会选择购买它,或者也许他们找到了替代品。此信息记录在事件日志中,但会在从购物车中删除物品时删除物品的数据库中丢失[ 42 ]。

Immutable events also capture more information than just the current state. For example, on a shopping website, a customer may add an item to their cart and then remove it again. Although the second event cancels out the first event from the point of view of order fulfillment, it may be useful to know for analytics purposes that the customer was considering a particular item but then decided against it. Perhaps they will choose to buy it in the future, or perhaps they found a substitute. This information is recorded in an event log, but would be lost in a database that deletes items when they are removed from the cart [42].

从同一事件日志导出多个视图

Deriving several views from the same event log

此外,通过将可变状态与不可变事件日志分离,您可以从同一事件日志中派生出几种不同的面向读取的表示形式。这就像一个流有多个消费者一样(图 11-5):例如,分析数据库 Druid 使用这种方法直接从 Kafka 摄取[ 55 ],Pistachio 是一个分布式键值存储,使用 Kafka 作为提交日志[ 56 ],Kafka Connect接收器可以将数据从Kafka导出到各种不同的数据库和索引[ 41 ]。对于许多其他存储和索引系统(例如搜索服务器)来说,类似地从分布式日志中获取输入是有意义的(请参阅“保持系统同步”)。

Moreover, by separating mutable state from the immutable event log, you can derive several different read-oriented representations from the same log of events. This works just like having multiple consumers of a stream (Figure 11-5): for example, the analytic database Druid ingests directly from Kafka using this approach [55], Pistachio is a distributed key-value store that uses Kafka as a commit log [56], and Kafka Connect sinks can export data from Kafka to various different databases and indexes [41]. It would make sense for many other storage and indexing systems, such as search servers, to similarly take their input from a distributed log (see “Keeping Systems in Sync”).

通过从事件日志到数据库的显式转换步骤,可以更轻松地随着时间的推移发展您的应用程序:如果您想引入以某种新方式呈现现有数据的新功能,您可以使用事件日志构建一个单独的新功能的读取优化视图,并与现有系统一起运行,而无需对其进行修改。并行运行新旧系统通常比在现有系统中执行复杂的模式迁移更容易。一旦不再需要旧系统,您可以简单地将其关闭并回收其资源 [ 47 , 57 ]。

Having an explicit translation step from an event log to a database makes it easier to evolve your application over time: if you want to introduce a new feature that presents your existing data in some new way, you can use the event log to build a separate read-optimized view for the new feature, and run it alongside the existing systems without having to modify them. Running old and new systems side by side is often easier than performing a complicated schema migration in an existing system. Once the old system is no longer needed, you can simply shut it down and reclaim its resources [47, 57].

如果您不必担心如何查询和访问数据,那么存储数据通常非常简单。模式设计、索引和存储引擎的许多复杂性都是由于想要支持某些查询和访问模式而导致的(参见第 3 章)。因此,通过将写入数据的形式与读取数据的形式分开,并允许多个不同的读取视图,您可以获得很大的灵活性。这个想法有时被称为 命令查询责任分离( CQRS)[ 42、58、59 ]

Storing data is normally quite straightforward if you don’t have to worry about how it is going to be queried and accessed; many of the complexities of schema design, indexing, and storage engines are the result of wanting to support certain query and access patterns (see Chapter 3). For this reason, you gain a lot of flexibility by separating the form in which data is written from the form it is read, and by allowing several different read views. This idea is sometimes known as command query responsibility segregation (CQRS) [42, 58, 59].

传统的数据库和模式设计方法基于这样的谬论:数据必须以与查询的形式相同的形式写入。如果您可以将数据从写优化的事件日志转换为读优化的应用程序状态,那么关于规范化和非规范化(请参阅 “多对一和多对多关系” )的争论就变得基本上无关紧要:非规范化是完全合理的读取优化视图中的数据,因为转换过程为您提供了一种使其与事件日志保持一致的机制。

The traditional approach to database and schema design is based on the fallacy that data must be written in the same form as it will be queried. Debates about normalization and denormalization (see “Many-to-One and Many-to-Many Relationships”) become largely irrelevant if you can translate data from a write-optimized event log to read-optimized application state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation process gives you a mechanism for keeping it consistent with the event log.

“描述负载”中,我们讨论了 Twitter 的主页时间线,这是特定用户关注的人最近编写的推文的缓存(如邮箱)。这是阅读优化状态的另一个例子:主页时间线高度非规范化,因为你的推文在关注你的人的所有时间线中都是重复的。但是,扇出服务使这种重复状态与新推文和新关注关系保持同步,从而使重复易于管理。

In “Describing Load” we discussed Twitter’s home timelines, a cache of recently written tweets by the people a particular user is following (like a mailbox). This is another example of read-optimized state: home timelines are highly denormalized, since your tweets are duplicated in all of the timelines of the people following you. However, the fan-out service keeps this duplicated state in sync with new tweets and new following relationships, which keeps the duplication manageable.

并发控制

Concurrency control

事件源和更改数据捕获的最大缺点是事件日志的使用者通常是异步的,因此用户可能会写入日志,然后从日志派生视图中读取并发现他们的写入尚未反映在读取视图中。我们之前在“阅读你自己的写作”中讨论过这个问题和潜在的解决方案。

The biggest downside of event sourcing and change data capture is that the consumers of the event log are usually asynchronous, so there is a possibility that a user may make a write to the log, then read from a log-derived view and find that their write has not yet been reflected in the read view. We discussed this problem and potential solutions previously in “Reading Your Own Writes”.

一种解决方案是同步执行读取视图的更新,并将事件附加到日志中。这需要一个事务将写入组合成一个原子单元,因此您要么需要将事件日志和读取视图保留在同一存储系统中,要么需要跨不同系统的分布式事务。或者,您可以使用“使用全序广播实现线性化存储”中讨论的方法 。

One solution would be to perform the updates of the read view synchronously with appending the event to the log. This requires a transaction to combine the writes into an atomic unit, so either you need to keep the event log and the read view in the same storage system, or you need a distributed transaction across the different systems. Alternatively, you could use the approach discussed in “Implementing linearizable storage using total order broadcast”.

另一方面,从事件日志中获取当前状态也简化了并发控制的某些方面。对多对象事务的大部分需求(请参阅 “单对象和多对象操作”)源于需要在多个不同位置更改数据的单个用户操作。通过事件溯源,您可以设计一个事件,使其成为用户操作的独立描述。然后,用户操作只需要在一个地方进行一次写入,即将事件附加到日志中,这很容易实现原子化。

On the other hand, deriving the current state from an event log also simplifies some aspects of concurrency control. Much of the need for multi-object transactions (see “Single-Object and Multi-Object Operations”) stems from a single user action requiring data to be changed in several different places. With event sourcing, you can design an event such that it is a self-contained description of a user action. The user action then requires only a single write in one place—namely appending the events to the log—which is easy to make atomic.

如果事件日志和应用程序状态以相同的方式分区(例如,为分区3中的客户处理事件只需要更新应用程序状态的分区3),那么简单的单线程日志消费者不需要并发控制对于写入——通过构造,它一次仅处理一个事件(另请参见“实际串行执行”)。日志通过定义分区中事件的串行顺序来消除并发的不确定性[ 24 ]。如果一个事件涉及多个状态分区,则需要做更多的工作,我们将在第 12 章中讨论。

If the event log and the application state are partitioned in the same way (for example, processing an event for a customer in partition 3 only requires updating partition 3 of the application state), then a straightforward single-threaded log consumer needs no concurrency control for writes—by construction, it only processes a single event at a time (see also “Actual Serial Execution”). The log removes the nondeterminism of concurrency by defining a serial order of events in a partition [24]. If an event touches multiple state partitions, a bit more work is required, which we will discuss in Chapter 12.

不变性的局限性

Limitations of immutability

许多不使用事件源模型的系统仍然依赖于不变性:各种数据库内部使用不可变数据结构或多版本数据来支持时间点快照(请参阅“索引和快照隔离” 。Git、Mercurial 和 Fossil 等版本控制系统也依赖不可变数据来保存文件的版本历史记录。

Many systems that don’t use an event-sourced model nevertheless rely on immutability: various databases internally use immutable data structures or multi-version data to support point-in-time snapshots (see “Indexes and snapshot isolation”). Version control systems such as Git, Mercurial, and Fossil also rely on immutable data to preserve version history of files.

永远保留所有变化的不可变历史在多大程度上是可行的?答案取决于数据集中的流失量。有些工作负载主要添加数据,很少更新或删除;它们很容易变得不可变。其他工作负载在相对较小的数据集上具有较高的更新和删除率;在这些情况下,不可变的历史记录可能会变得非常大,碎片可能会成为一个问题,而压缩和垃圾收集的性能对于操作的稳健性变得至关重要[ 60 , 61 ]。

To what extent is it feasible to keep an immutable history of all changes forever? The answer depends on the amount of churn in the dataset. Some workloads mostly add data and rarely update or delete; they are easy to make immutable. Other workloads have a high rate of updates and deletes on a comparatively small dataset; in these cases, the immutable history may grow prohibitively large, fragmentation may become an issue, and the performance of compaction and garbage collection becomes crucial for operational robustness [60, 61].

除了性能原因之外,在某些情况下,尽管数据具有不变性,但出于管理原因,您也可能需要删除数据。例如,隐私法规可能要求在用户关闭帐户后删除用户的个人信息,数据保护法规可能要求删除错误信息,或者可能需要遏制敏感信息的意外泄漏。

Besides the performance reasons, there may also be circumstances in which you need data to be deleted for administrative reasons, in spite of all immutability. For example, privacy regulations may require deleting a user’s personal information after they close their account, data protection legislation may require erroneous information to be removed, or an accidental leak of sensitive information may need to be contained.

在这些情况下,仅将另一个事件附加到日志中来指示应将之前的数据视为已删除是不够的,您实际上想要重写历史记录并假装数据从未被写入。例如,Datomic 将此功能称为切除 [ 62 ],而 Fossil 版本控制系统也有类似的概念,称为回避 [ 63 ]。

In these circumstances, it’s not sufficient to just append another event to the log to indicate that the prior data should be considered deleted—you actually want to rewrite history and pretend that the data was never written in the first place. For example, Datomic calls this feature excision [62], and the Fossil version control system has a similar concept called shunning [63].

真正删除数据出奇地困难[ 64 ],因为副本可以存在于许多地方:例如,存储引擎、文件系统和SSD通常写入新位置而不是就地覆盖[ 52 ],并且备份通常故意不可变防止意外删除或损坏。删除更多的是“使检索数据变得更加困难”,而不是实际上“使检索数据变得不可能”。然而,有时你必须尝试,正如我们将在“立法与自律”中看到的那样。

Truly deleting data is surprisingly hard [64], since copies can live in many places: for example, storage engines, filesystems, and SSDs often write to a new location rather than overwriting in place [52], and backups are often deliberately immutable to prevent accidental deletion or corruption. Deletion is more a matter of “making it harder to retrieve the data” than actually “making it impossible to retrieve the data.” Nevertheless, you sometimes have to try, as we shall see in “Legislation and self-regulation”.

处理流

Processing Streams

到目前为止,在本章中,我们已经讨论了流的来源(用户活动事件、传感器和对数据库的写入),并且讨论了流的传输方式(通过直接消息传递、消息代理和事件日志)。

So far in this chapter we have talked about where streams come from (user activity events, sensors, and writes to databases), and we have talked about how streams are transported (through direct messaging, via message brokers, and in event logs).

剩下的就是讨论一旦拥有流就可以用它做什么,即可以处理它。总的来说,有以下三种选择:

What remains is to discuss what you can do with the stream once you have it—namely, you can process it. Broadly, there are three options:

  1. 您可以获取事件中的数据并将其写入数据库、缓存、搜索索引或类似的存储系统,然后其他客户端可以从中查询数据。如图 11-5所示 ,这是使数据库与系统其他部分发生的更改保持同步的好方法,特别是当流使用者是写入数据库的唯一客户端时。写入存储系统相当于我们在“批处理工作流的输出”中讨论的流式传输。

  2. You can take the data in the events and write it to a database, cache, search index, or similar storage system, from where it can then be queried by other clients. As shown in Figure 11-5, this is a good way of keeping a database in sync with changes happening in other parts of the system—especially if the stream consumer is the only client writing to the database. Writing to a storage system is the streaming equivalent of what we discussed in “The Output of Batch Workflows”.

  3. 您可以通过某种方式将事件推送给用户,例如通过发送电子邮件警报或推送通知,或者将事件流式传输到实时仪表板并在其中进行可视化。在这种情况下,人类是流的最终消费者。

  4. You can push the events to users in some way, for example by sending email alerts or push notifications, or by streaming the events to a real-time dashboard where they are visualized. In this case, a human is the ultimate consumer of the stream.

  5. 您可以处理一个或多个输入流以生成一个或多个输出流。流在最终到达输出之前可能会经过由多个此类处理阶段组成的管道(选项 1 或 2)。

  6. You can process one or more input streams to produce one or more output streams. Streams may go through a pipeline consisting of several such processing stages before they eventually end up at an output (option 1 or 2).

在本章的其余部分,我们将讨论选项 3:处理流以生成其他派生流。处理这样的流的代码称为操作符作业它与我们在第 10 章中讨论的 Unix 进程和 MapReduce 作业密切相关,并且数据流的模式类似:流处理器以只读方式使用输入流,并将其输出以仅附加方式写入不同的位置。时尚。

In the rest of this chapter, we will discuss option 3: processing streams to produce other, derived streams. A piece of code that processes streams like this is known as an operator or a job. It is closely related to the Unix processes and MapReduce jobs we discussed in Chapter 10, and the pattern of dataflow is similar: a stream processor consumes input streams in a read-only fashion and writes its output to a different location in an append-only fashion.

流处理器中的分区和并行化模式也与我们在第 10 章 中看到的 MapReduce 和数据流引擎中的模式非常相似,因此我们不会在这里重复这些主题。基本映射操作(例如转换和过滤记录)的工作方式也是相同的。

The patterns for partitioning and parallelization in stream processors are also very similar to those in MapReduce and the dataflow engines we saw in Chapter 10, so we won’t repeat those topics here. Basic mapping operations such as transforming and filtering records also work the same.

与批处理作业的一个关键区别是流永远不会结束。这种差异有很多含义:正如本章开头所讨论的,排序对于无界数据集没有意义,因此不能使用排序合并连接(请参阅“ Reduce-Side Joins and Grouping” )。容错机制也必须改变:对于已经运行了几分钟的批处理作业,失败的任务可以简单地从头重新启动,但对于已经运行了几年的流作业,失败后可以从头重新启动。崩溃可能不是一个可行的选择。

The one crucial difference to batch jobs is that a stream never ends. This difference has many implications: as discussed at the start of this chapter, sorting does not make sense with an unbounded dataset, and so sort-merge joins (see “Reduce-Side Joins and Grouping”) cannot be used. Fault-tolerance mechanisms must also change: with a batch job that has been running for a few minutes, a failed task can simply be restarted from the beginning, but with a stream job that has been running for several years, restarting from the beginning after a crash may not be a viable option.

流处理的用途

Uses of Stream Processing

流处理长期以来一直用于监控目的,组织希望在发生某些事情时收到警报。例如:

Stream processing has long been used for monitoring purposes, where an organization wants to be alerted if certain things happen. For example:

  • 欺诈检测系统需要确定信用卡的使用模式是否意外改变,并在卡可能被盗时阻止该卡。

  • Fraud detection systems need to determine if the usage patterns of a credit card have unexpectedly changed, and block the card if it is likely to have been stolen.

  • 交易系统需要检查金融市场的价格变化并根据指定的规则执行交易。

  • Trading systems need to examine price changes in a financial market and execute trades according to specified rules.

  • 制造系统需要监控工厂中机器的状态,并在出现故障时快速识别问题。

  • Manufacturing systems need to monitor the status of machines in a factory, and quickly identify the problem if there is a malfunction.

  • 军事和情报系统需要跟踪潜在攻击者的活动,并在出现攻击迹象时发出警报。

  • Military and intelligence systems need to track the activities of a potential aggressor, and raise the alarm if there are signs of an attack.

此类应用需要相当复杂的模式匹配和关联。然而,随着时间的推移,流处理的其他用途也出现了。在本节中,我们将简要比较和对比其中一些应用程序。

These kinds of applications require quite sophisticated pattern matching and correlations. However, other uses of stream processing have also emerged over time. In this section we will briefly compare and contrast some of these applications.

复杂事件处理

Complex event processing

复杂事件处理(CEP) 是 20 世纪 90 年代开发的一种用于分析事件流的方法,特别适合需要搜索某些事件模式的应用程序 [ 65 , 66 ]。与正则表达式允许您搜索字符串中的某些字符模式的方式类似,CEP 允许您指定规则来搜索流中的某些事件模式。

Complex event processing (CEP) is an approach developed in the 1990s for analyzing event streams, especially geared toward the kind of application that requires searching for certain event patterns [65, 66]. Similarly to the way that a regular expression allows you to search for certain patterns of characters in a string, CEP allows you to specify rules to search for certain patterns of events in a stream.

CEP 系统通常使用高级声明性查询语言(例如 SQL)或图形用户界面来描述应检测的事件模式。这些查询被提交给处理引擎,该处理引擎使用输入流并在内部维护执行所需匹配的状态机。当找到匹配时,引擎会发出一个复杂事件(因此得名),其中包含检测到的事件模式的详细信息[ 67 ]。

CEP systems often use a high-level declarative query language like SQL, or a graphical user interface, to describe the patterns of events that should be detected. These queries are submitted to a processing engine that consumes the input streams and internally maintains a state machine that performs the required matching. When a match is found, the engine emits a complex event (hence the name) with the details of the event pattern that was detected [67].

在这些系统中,查询和数据之间的关系与普通数据库相比是相反的。通常,数据库会持久存储数据并将查询视为瞬态:当查询进入时,数据库会搜索与查询匹配的数据,然后在查询完成后忘记该查询。CEP 引擎颠倒了这些角色:查询被长期存储,来自输入流的事件不断流过它们以搜索与事件模式匹配的查询[68 ]

In these systems, the relationship between queries and data is reversed compared to normal databases. Usually, a database stores data persistently and treats queries as transient: when a query comes in, the database searches for data matching the query, and then forgets about the query when it has finished. CEP engines reverse these roles: queries are stored long-term, and events from the input streams continuously flow past them in search of a query that matches an event pattern [68].

CEP 的实现包括 Esper [ 69 ]、IBM InfoSphere Streams [ 70 ]、Apama、TIBCO StreamBase 和 SQLstream。像 Samza 这样的分布式流处理器也获得了对流上声明性查询的 SQL 支持 [ 71 ]。

Implementations of CEP include Esper [69], IBM InfoSphere Streams [70], Apama, TIBCO StreamBase, and SQLstream. Distributed stream processors like Samza are also gaining SQL support for declarative queries on streams [71].

流分析

Stream analytics

使用流处理的另一个领域是流分析。CEP 和流分析之间的界限很模糊,但作为一般规则,分析往往对查找特定事件序列不太感兴趣,而更注重大量事件的聚合和统计指标,例如:

Another area in which stream processing is used is for analytics on streams. The boundary between CEP and stream analytics is blurry, but as a general rule, analytics tends to be less interested in finding specific event sequences and is more oriented toward aggregations and statistical metrics over a large number of events—for example:

  • 测量某种类型事件的发生率(每个时间间隔发生的频率)

  • Measuring the rate of some type of event (how often it occurs per time interval)

  • 计算某个时间段内的值的滚动平均值

  • Calculating the rolling average of a value over some time period

  • 将当前统计数据与之前的时间间隔进行比较(例如,检测趋势或对与上周同一时间相比异常高或低的指标发出警报)

  • Comparing current statistics to previous time intervals (e.g., to detect trends or to alert on metrics that are unusually high or low compared to the same time last week)

此类统计信息通常是在固定的时间间隔内计算的 - 例如,您可能想知道过去 5 分钟内每秒对服务的平均查询数,以及该时间段内的第 99 个百分位响应时间。对几分钟进行平均可以消除每一秒之间不相关的波动,同时仍然可以让您及时了解流量模式的任何变化。聚合的时间间隔称为窗口,我们将在“关于时间的推理”中更详细地研究窗口。

Such statistics are usually computed over fixed time intervals—for example, you might want to know the average number of queries per second to a service over the last 5 minutes, and their 99th percentile response time during that period. Averaging over a few minutes smoothes out irrelevant fluctuations from one second to the next, while still giving you a timely picture of any changes in traffic pattern. The time interval over which you aggregate is known as a window, and we will look into windowing in more detail in “Reasoning About Time”.

流分析系统有时使用概率算法,例如用于集合成员资格的布隆过滤器(我们在“性能优化”中遇到)、用于基数估计的 HyperLogLog [ 72 ] 以及各种百分位数估计算法(参见“实践中的百分位数”)。概率算法产生近似结果,但具有比精确算法所需的流处理器内存少得多的优点。这种近似算法的使用有时会让人们相信流处理系统总是有损和不精确的,但这是错误的:流处理本质上没有什么近似性,概率算法仅仅是一种优化[73 ]

Stream analytics systems sometimes use probabilistic algorithms, such as Bloom filters (which we encountered in “Performance optimizations”) for set membership, HyperLogLog [72] for cardinality estimation, and various percentile estimation algorithms (see “Percentiles in Practice”). Probabilistic algorithms produce approximate results, but have the advantage of requiring significantly less memory in the stream processor than exact algorithms. This use of approximation algorithms sometimes leads people to believe that stream processing systems are always lossy and inexact, but that is wrong: there is nothing inherently approximate about stream processing, and probabilistic algorithms are merely an optimization [73].

许多开源分布式流处理框架在设计时都考虑到了分析:例如 Apache Storm、Spark Streaming、Flink、Concord、Samza 和 Kafka Streams [ 74 ]。托管服务包括 Google Cloud Dataflow 和 Azure Stream Analytics。

Many open source distributed stream processing frameworks are designed with analytics in mind: for example, Apache Storm, Spark Streaming, Flink, Concord, Samza, and Kafka Streams [74]. Hosted services include Google Cloud Dataflow and Azure Stream Analytics.

维护物化视图

Maintaining materialized views

我们在“数据库和流”中看到,对数据库的更改流可用于保持派生数据系统(例如缓存、搜索索引和数据仓库)与源数据库保持同步。我们可以将这些示例视为维护物化视图的特定案例(请参阅 “聚合:数据立方体和物化视图”):在某些数据集上派生出替代视图,以便您可以有效地查询它,并在底层数据发生变化时更新该视图[ 50 ]。

We saw in “Databases and Streams” that a stream of changes to a database can be used to keep derived data systems, such as caches, search indexes, and data warehouses, up to date with a source database. We can regard these examples as specific cases of maintaining materialized views (see “Aggregation: Data Cubes and Materialized Views”): deriving an alternative view onto some dataset so that you can query it efficiently, and updating that view whenever the underlying data changes [50].

类似地,在事件溯源中,应用程序状态是通过应用事件日志来维护的;这里的应用程序状态也是一种物化视图。与流分析场景不同,仅考虑某个时间窗口内的事件通常是不够的:构建物化视图可能需要任意时间段内的所有事件,除了可能被日志压缩丢弃的任何过时事件(请参阅“日志压缩”) ”)。实际上,您需要一个一直延伸到时间开始的窗口。

Similarly, in event sourcing, application state is maintained by applying a log of events; here the application state is also a kind of materialized view. Unlike stream analytics scenarios, it is usually not sufficient to consider only events within some time window: building the materialized view potentially requires all events over an arbitrary time period, apart from any obsolete events that may be discarded by log compaction (see “Log compaction”). In effect, you need a window that stretches all the way back to the beginning of time.

原则上,任何流处理器都可以用于物化视图维护,尽管永远维护事件的需要与一些主要在有限持续时间的窗口上运行的面向分析的框架的假设背道而驰。Samza 和 Kafka Streams 基于 Kafka 对日志压缩的支持而支持这种用法 [ 75 ]。

In principle, any stream processor could be used for materialized view maintenance, although the need to maintain events forever runs counter to the assumptions of some analytics-oriented frameworks that mostly operate on windows of a limited duration. Samza and Kafka Streams support this kind of usage, building upon Kafka’s support for log compaction [75].

在流中搜索

Search on streams

除了允许搜索由多个事件组成的模式的 CEP 之外,有时还需要基于复杂的条件搜索单个事件,例如全文搜索查询。

Besides CEP, which allows searching for patterns consisting of multiple events, there is also sometimes a need to search for individual events based on complex criteria, such as full-text search queries.

例如,媒体监控服务订阅新闻文章和媒体广播的提要,并搜索任何提及感兴趣的公司、产品或主题的新闻。这是通过提前制定搜索查询,然后不断地将新闻项流与该查询进行匹配来完成的。一些网站上也存在类似的功能:例如,房地产网站的用户可以要求在市场上出现符合其搜索条件的新房产时收到通知。Elasticsearch [ 76 ]的渗透器功能是实现此类流搜索的一种选择。

For example, media monitoring services subscribe to feeds of news articles and broadcasts from media outlets, and search for any news mentioning companies, products, or topics of interest. This is done by formulating a search query in advance, and then continually matching the stream of news items against this query. Similar features exist on some websites: for example, users of real estate websites can ask to be notified when a new property matching their search criteria appears on the market. The percolator feature of Elasticsearch [76] is one option for implementing this kind of stream search.

传统的搜索引擎首先对文档建立索引,然后对索引运行查询。相比之下,搜索流则彻底改变了处理过程:存储查询,然后文档运行经过查询,就像在 CEP 中一样。在最简单的情况下,您可以针对每个查询测试每个文档,尽管如果您有大量查询,这可能会变得很慢。为了优化该过程,可以对查询和文档进行索引,从而缩小可能匹配的查询集[ 77 ]。

Conventional search engines first index the documents and then run queries over the index. By contrast, searching a stream turns the processing on its head: the queries are stored, and the documents run past the queries, like in CEP. In the simplest case, you can test every document against every query, although this can get slow if you have a large number of queries. To optimize the process, it is possible to index the queries as well as the documents, and thus narrow down the set of queries that may match [77].

消息传递和 RPC

Message passing and RPC

“消息传递数据流”中,我们讨论了消息传递系统作为 RPC 的替代方案,即作为服务通信的机制,例如在参与者模型中使用的机制。尽管这些系统也是基于消息和事件的,但我们通常不将它们视为流处理器:

In “Message-Passing Dataflow” we discussed message-passing systems as an alternative to RPC—i.e., as a mechanism for services to communicate, as used for example in the actor model. Although these systems are also based on messages and events, we normally don’t think of them as stream processors:

  • Actor 框架主要是一种管理通信模块的并发和分布式执行的机制,而流处理主要是一种数据管理技术。

  • Actor frameworks are primarily a mechanism for managing concurrency and distributed execution of communicating modules, whereas stream processing is primarily a data management technique.

  • 参与者之间的通信通常是短暂的、一对一的,而事件日志是持久的、多订阅者的。

  • Communication between actors is often ephemeral and one-to-one, whereas event logs are durable and multi-subscriber.

  • 参与者可以以任意方式进行通信(包括循环请求/响应模式),但流处理器通常设置在非循环管道中,其中每个流都是一个特定作业的输出,并且源自一组明确定义的输入流。

  • Actors can communicate in arbitrary ways (including cyclic request/response patterns), but stream processors are usually set up in acyclic pipelines where every stream is the output of one particular job, and derived from a well-defined set of input streams.

也就是说,类 RPC 系统和流处理之间存在一些交叉领域。例如,Apache Storm 有一个称为分布式 RPC 的功能,该功能允许将用户查询外包给一组也处理事件流的节点;然后,这些查询与来自输入流的事件交织在一起,并且可以聚合结果并将其发送回用户[ 78 ]。(另请参见“多分区数据处理”。)

That said, there is some crossover area between RPC-like systems and stream processing. For example, Apache Storm has a feature called distributed RPC, which allows user queries to be farmed out to a set of nodes that also process event streams; these queries are then interleaved with events from the input streams, and results can be aggregated and sent back to the user [78]. (See also “Multi-partition data processing”.)

还可以使用参与者框架来处理流。但是,许多此类框架不保证崩溃时的消息传递,因此除非您实现额外的重试逻辑,否则处理不是容错的。

It is also possible to process streams using actor frameworks. However, many such frameworks do not guarantee message delivery in the case of crashes, so the processing is not fault-tolerant unless you implement additional retry logic.

关于时间的推理

Reasoning About Time

流处理器通常需要处理时间,特别是在用于分析目的时,经常使用时间窗口,例如“过去五分钟的平均值”。看起来“最后五分钟”的含义应该是明确且明确的,但不幸的是,这个概念出奇地棘手。

Stream processors often need to deal with time, especially when used for analytics purposes, which frequently use time windows such as “the average over the last five minutes.” It might seem that the meaning of “the last five minutes” should be unambiguous and clear, but unfortunately the notion is surprisingly tricky.

在批处理过程中,处理任务快速处理大量历史事件。如果需要按时间进行某种细分,则批处理需要查看每个事件中嵌入的时间戳。查看运行批处理进程的计算机的系统时钟是没有意义的,因为该进程运行的时间与事件实际发生的时间无关。

In a batch process, the processing tasks rapidly crunch through a large collection of historical events. If some kind of breakdown by time needs to happen, the batch process needs to look at the timestamp embedded in each event. There is no point in looking at the system clock of the machine running the batch process, because the time at which the process is run has nothing to do with the time at which the events actually occurred.

批处理可以在几分钟内读取一年的历史事件;在大多数情况下,感兴趣的时间线是历史的一年,而不是处理的几分钟。此外,在事件中使用时间戳可以使处理具有确定性:在相同的输入上再次运行相同的过程会产生相同的结果(请参阅“容错”)。

A batch process may read a year’s worth of historical events within a few minutes; in most cases, the timeline of interest is the year of history, not the few minutes of processing. Moreover, using the timestamps in the events allows the processing to be deterministic: running the same process again on the same input yields the same result (see “Fault tolerance”).

另一方面,许多流处理框架使用处理机器上的本地系统时钟(处理时间)来确定窗口[ 79 ]。这种方法的优点是简单,如果事件创建和事件处理之间的延迟可以忽略不计,那么这种方法是合理的。然而,如果存在任何明显的处理滞后,即处理可能明显晚于事件实际发生的时间,那么它就会崩溃。

On the other hand, many stream processing frameworks use the local system clock on the processing machine (the processing time) to determine windowing [79]. This approach has the advantage of being simple, and it is reasonable if the delay between event creation and event processing is negligibly short. However, it breaks down if there is any significant processing lag—i.e., if the processing may happen noticeably later than the time at which the event actually occurred.

事件时间与处理时间

Event time versus processing time

处理延迟的原因有很多:排队、网络故障(请参阅 “不可靠的网络”)、导致消息代理或处理器中争用的性能问题、流使用者的重新启动或对过去事件的重新处理(请参阅 从故障中恢复或修复代码中的错误后重播旧消息” )。

There are many reasons why processing may be delayed: queueing, network faults (see “Unreliable Networks”), a performance issue leading to contention in the message broker or processor, a restart of the stream consumer, or reprocessing of past events (see “Replaying old messages”) while recovering from a fault or after fixing a bug in the code.

此外,消息延迟还会导致不可预测的消息排序。例如,假设用户首先发出一个 Web 请求(由 Web 服务器 A 处理),然后发出第二个请求(由服务器 B 处理)。A 和 B 发出描述它们处理的请求的事件,但 B 的事件先于 A 的事件到达消息代理。现在,流处理器将首先看到 B 事件,然后看到 A 事件,即使它们实际上以相反的顺序发生。

Moreover, message delays can also lead to unpredictable ordering of messages. For example, say a user first makes one web request (which is handled by web server A), and then a second request (which is handled by server B). A and B emit events describing the requests they handled, but B’s event reaches the message broker before A’s event does. Now stream processors will first see the B event and then the A event, even though they actually occurred in the opposite order.

如果可以做个类比,请考虑《星球大战》电影:《星球大战》系列电影于 1977 年上映,《第五集》于 1980 年上映,《第六集》于 1983 年上映,随后《星球大战》第一部、第二部和第三部分别于 1999 年、2002 年和 2005 年上映。 ,以及2015年的第七集[ 80 ]。ii如果您按照电影上映的顺序观看电影,那么您处理电影的顺序与电影的叙述顺序不一致。(剧集编号就像事件时间戳,观看电影的日期就是处理时间。)作为人类,我们能够应对这种不连续性,但需要专门编写流处理算法来适应这种时序和处理时间。订购问题。

If it helps to have an analogy, consider the Star Wars movies: Episode IV was released in 1977, Episode V in 1980, and Episode VI in 1983, followed by Episodes I, II, and III in 1999, 2002, and 2005, respectively, and Episode VII in 2015 [80].ii If you watched the movies in the order they came out, the order in which you processed the movies is inconsistent with the order of their narrative. (The episode number is like the event timestamp, and the date when you watched the movie is the processing time.) As humans, we are able to cope with such discontinuities, but stream processing algorithms need to be specifically written to accommodate such timing and ordering issues.

混淆事件时间和处理时间会导致数据错误。例如,假设您有一个流处理器来测量请求速率(计算每秒的请求数)。如果您重新部署流处理器,它可能会关闭一分钟,并在恢复时处理积压的事件。如果您根据处理时间来衡量速率,那么在处理积压的过程中,看起来好像请求突然出现异常峰值,而实际上请求的实际速率是稳定的(图 11-7

Confusing event time and processing time leads to bad data. For example, say you have a stream processor that measures the rate of requests (counting the number of requests per second). If you redeploy the stream processor, it may be shut down for a minute and process the backlog of events when it comes back up. If you measure the rate based on the processing time, it will look as if there was a sudden anomalous spike of requests while processing the backlog, when in fact the real rate of requests was steady (Figure 11-7).

迪迪亚1107
图 11-7。由于处理速率的变化,按处理时间加窗会引入伪影。

知道你什么时候准备好

Knowing when you’re ready

根据事件时间定义窗口时的一个棘手问题是,您永远无法确定何时收到特定窗口的所有事件,或者是否还有一些事件即将发生。

A tricky problem when defining windows in terms of event time is that you can never be sure when you have received all of the events for a particular window, or whether there are some events still to come.

例如,假设您将事件分组到一分钟的窗口中,以便可以计算每分钟的请求数。您已经计算了一些时间戳落在该小时第 37 分钟的事件数量,并且时间已经过去;现在,大多数传入事件都发生在每小时的第 38 和 39 分钟内。你什么时候声明你已经完成了第37分钟的窗口,并输出它的计数器值?

For example, say you’re grouping events into one-minute windows so that you can count the number of requests per minute. You have counted some number of events with timestamps that fall in the 37th minute of the hour, and time has moved on; now most of the incoming events fall within the 38th and 39th minutes of the hour. When do you declare that you have finished the window for the 37th minute, and output its counter value?

在一段时间内没有看到任何新事件后,您可以超时并声明窗口准备就绪,但仍然可能发生某些事件在另一台计算机上的某处缓冲,由于网络中断而延迟的情况。您需要能够处理在窗口已声明完成后到达的此类落后事件。一般来说,您有两个选择 [ 1 ]:

You can time out and declare a window ready after you have not seen any new events for a while, but it could still happen that some events were buffered on another machine somewhere, delayed due to a network interruption. You need to be able to handle such straggler events that arrive after the window has already been declared complete. Broadly, you have two options [1]:

  1. 忽略掉队事件,因为它们可能只占正常情况下事件的一小部分。您可以跟踪丢弃的事件数量作为指标,并在开始丢弃大量数据时发出警报。

  2. Ignore the straggler events, as they are probably a small percentage of events in normal circumstances. You can track the number of dropped events as a metric, and alert if you start dropping a significant amount of data.

  3. 发布修正,即包含落后者的窗口的更新值。您可能还需要撤回以前的输出。

  4. Publish a correction, an updated value for the window with stragglers included. You may also need to retract the previous output.

在某些情况下,可以使用特殊消息来指示“从现在开始,将不再有时间戳早于t 的消息”,消费者可以使用它来触发窗口[ 81 ]。但是,如果不同机器上的多个生产者正在生成事件,每个事件都有自己的最小时间戳阈值,则消费者需要单独跟踪每个生产者。在这种情况下,添加和删除生产者会更加棘手。

In some cases it is possible to use a special message to indicate, “From now on there will be no more messages with a timestamp earlier than t,” which can be used by consumers to trigger windows [81]. However, if several producers on different machines are generating events, each with their own minimum timestamp thresholds, the consumers need to keep track of each producer individually. Adding and removing producers is trickier in this case.

无论如何,你用的是谁的时钟?

Whose clock are you using, anyway?

当事件可以在系统中的多个点进行缓冲时,为事件分配时间戳就更加困难。例如,考虑一个向服务器报告使用指标事件的移动应用程序。该应用程序可以在设备离线时使用,在这种情况下,它将在设备本地缓冲事件,并在下次互联网连接可用时(可能是几小时甚至几天后)将它们发送到服务器。对于该流的任何消费者来说,事件将显示为极度延迟的落后者。

Assigning timestamps to events is even more difficult when events can be buffered at several points in the system. For example, consider a mobile app that reports events for usage metrics to a server. The app may be used while the device is offline, in which case it will buffer events locally on the device and send them to a server when an internet connection is next available (which may be hours or even days later). To any consumers of this stream, the events will appear as extremely delayed stragglers.

在这种情况下,根据移动设备的本地时钟,事件的时间戳实际上应该是用户交互发生的时间。然而,用户控制设备上的时钟通常不可信,因为它可能会被意外或故意设置为错误的时间(请参阅 “时钟同步和准确性”)。服务器接收事件的时间(根据服务器的时钟)更有可能是准确的,因为服务器在您的控制之下,但在描述用户交互方面意义不大。

In this context, the timestamp on the events should really be the time at which the user interaction occurred, according to the mobile device’s local clock. However, the clock on a user-controlled device often cannot be trusted, as it may be accidentally or deliberately set to the wrong time (see “Clock Synchronization and Accuracy”). The time at which the event was received by the server (according to the server’s clock) is more likely to be accurate, since the server is under your control, but less meaningful in terms of describing the user interaction.

要调整不正确的设备时钟,一种方法是记录三个时间戳 [ 82 ]:

To adjust for incorrect device clocks, one approach is to log three timestamps [82]:

  • 根据设备时钟,事件发生的时间

  • The time at which the event occurred, according to the device clock

  • 根据设备时钟将事件发送到服务器的时间

  • The time at which the event was sent to the server, according to the device clock

  • 服务器收到事件的时间(根据服务器时钟)

  • The time at which the event was received by the server, according to the server clock

通过从第三个时间戳中减去第二个时间戳,您可以估计设备时钟和服务器时钟之间的偏移(假设与所需的时间戳精度相比,网络延迟可以忽略不计)。然后,您可以将该偏移量应用于事件时间戳,从而估计事件实际发生的真实时间(假设设备时钟偏移量在事件发生的时间和发送到服务器的时间之间没有变化)。

By subtracting the second timestamp from the third, you can estimate the offset between the device clock and the server clock (assuming the network delay is negligible compared to the required timestamp accuracy). You can then apply that offset to the event timestamp, and thus estimate the true time at which the event actually occurred (assuming the device clock offset did not change between the time the event occurred and the time it was sent to the server).

这个问题并不是流处理所独有的,批处理也面临与时间推理完全相同的问题。它在流媒体环境中更加明显,我们更能意识到时间的流逝。

This problem is not unique to stream processing—batch processing suffers from exactly the same issues of reasoning about time. It is just more noticeable in a streaming context, where we are more aware of the passage of time.

窗户的类型

Types of windows

一旦您知道如何确定事件的时间戳,下一步就是决定如何定义时间段内的窗口。然后,该窗口可用于聚合,例如对事件进行计数,或计算窗口内值的平均值。常用的窗口有几种类型 [ 79 , 83 ]:

Once you know how the timestamp of an event should be determined, the next step is to decide how windows over time periods should be defined. The window can then be used for aggregations, for example to count events, or to calculate the average of values within the window. Several types of windows are in common use [79, 83]:

翻滚窗
Tumbling window

滚动窗口有固定的长度,每个事件都属于一个窗口。例如,如果您有一个 1 分钟的滚动窗口,则将时间戳在 10:03:00 到 10:03:59 之间的所有事件分组到一个窗口中,将 10:04:00 到 10:04:59 之间的事件分组到一个窗口中。下一个窗口,依此类推。您可以通过获取每个事件时间戳并将其向下舍入到最接近的分钟以确定其所属的窗口来实现 1 分钟的滚动窗口。

A tumbling window has a fixed length, and every event belongs to exactly one window. For example, if you have a 1-minute tumbling window, all the events with timestamps between 10:03:00 and 10:03:59 are grouped into one window, events between 10:04:00 and 10:04:59 into the next window, and so on. You could implement a 1-minute tumbling window by taking each event timestamp and rounding it down to the nearest minute to determine the window that it belongs to.

跳跃窗口
Hopping window

跳跃窗口也具有固定长度,但允许窗口重叠以提供一定程度的平滑。例如,跳跃大小为 1 分钟的 5 分钟窗口将包含 10:03:00 到 10:07:59 之间的事件,那么下一个窗口将包含 10:04:00 到 10:08 之间的事件: 59,等等。您可以通过首先计算 1 分钟的翻滚窗口,然后聚合多个相邻窗口来实现此跳跃窗口。

A hopping window also has a fixed length, but allows windows to overlap in order to provide some smoothing. For example, a 5-minute window with a hop size of 1 minute would contain the events between 10:03:00 and 10:07:59, then the next window would cover events between 10:04:00 and 10:08:59, and so on. You can implement this hopping window by first calculating 1-minute tumbling windows, and then aggregating over several adjacent windows.

滑动窗口
Sliding window

滑动窗口包含在一定间隔内发生的所有事件。例如,5 分钟滑动窗口将覆盖 10:03:39 和 10:08:12 的事件,因为它们相隔不到 5 分钟(请注意,翻滚和跳跃 5 分钟窗口不会将这两个事件放在一起)在同一窗口中,因为它们使用固定边界)。滑动窗口可以通过保留按时间排序的事件缓冲区并在旧事件从窗口过期时删除它们来实现。

A sliding window contains all the events that occur within some interval of each other. For example, a 5-minute sliding window would cover events at 10:03:39 and 10:08:12, because they are less than 5 minutes apart (note that tumbling and hopping 5-minute windows would not have put these two events in the same window, as they use fixed boundaries). A sliding window can be implemented by keeping a buffer of events sorted by time and removing old events when they expire from the window.

会话窗口
Session window

与其他窗口类型不同,会话窗口没有固定的持续时间。相反,它是通过将同一用户的所有在时间上紧密发生的事件分组在一起来定义的,并且当用户已经不活动一段时间(例如,如果 30 分钟没有事件)时窗口结束。会话化是网站分析的常见要求(请参阅 “GROUP BY”)。

Unlike the other window types, a session window has no fixed duration. Instead, it is defined by grouping together all events for the same user that occur closely together in time, and the window ends when the user has been inactive for some time (for example, if there have been no events for 30 minutes). Sessionization is a common requirement for website analytics (see “GROUP BY”).

流连接

Stream Joins

第 10 章中,我们讨论了批处理作业如何通过键连接数据集,以及这种连接如何构成数据管道的重要组成部分。由于流处理将数据管道推广到无限数据集的增量处理,因此流上的连接也有完全相同的需求。

In Chapter 10 we discussed how batch jobs can join datasets by key, and how such joins form an important part of data pipelines. Since stream processing generalizes data pipelines to incremental processing of unbounded datasets, there is exactly the same need for joins on streams.

然而,新事件可能随时出现在流上,这一事实使得流上的连接比批处理作业更具挑战性。为了更好地理解这种情况,让我们区分三种不同类型的连接:流-流连接、流-表连接和表-表连接[ 84 ]。在下面的部分中,我们将通过示例来说明每个内容。

However, the fact that new events can appear anytime on a stream makes joins on streams more challenging than in batch jobs. To understand the situation better, let’s distinguish three different types of joins: stream-stream joins, stream-table joins, and table-table joins [84]. In the following sections we’ll illustrate each by example.

流-流连接(窗口连接)

Stream-stream join (window join)

假设您的网站有搜索功能,并且您希望检测搜索 URL 的最新趋势。每次有人键入搜索查询时,您都会记录一个包含该查询和返回结果的事件。每次有人点击其中一个搜索结果时,您都会记录另一个记录该点击的事件。为了计算搜索结果中每个 URL 的点击率,您需要将搜索操作和点击操作的事件汇集在一起​​,这些事件通过具有相同的会话 ID 连接。广告系统也需要类似的分析[ 85 ]。

Say you have a search feature on your website, and you want to detect recent trends in searched-for URLs. Every time someone types a search query, you log an event containing the query and the results returned. Every time someone clicks one of the search results, you log another event recording the click. In order to calculate the click-through rate for each URL in the search results, you need to bring together the events for the search action and the click action, which are connected by having the same session ID. Similar analyses are needed in advertising systems [85].

如果用户放弃搜索,点击可能永远不会出现,即使出现,搜索和点击之间的时间也可能变化很大:在许多情况下可能只有几秒钟,但也可能长达数天或数天。几周(如果用户运行搜索,忘记了该浏览器选项卡,然后稍后返回该选项卡并单击结果)。由于网络延迟的变化,点击事件甚至可能在搜索事件之前到达。您可以为连接选择合适的窗口 - 例如,如果点击和搜索最多相隔一小时,您可以选择将它们连接起来。

The click may never come if the user abandons their search, and even if it comes, the time between the search and the click may be highly variable: in many cases it might be a few seconds, but it could be as long as days or weeks (if a user runs a search, forgets about that browser tab, and then returns to the tab and clicks a result sometime later). Due to variable network delays, the click event may even arrive before the search event. You can choose a suitable window for the join—for example, you may choose to join a click with a search if they occur at most one hour apart.

请注意,在单击事件中嵌入搜索详细信息并不等同于加入事件:这样做只会告诉您用户单击搜索结果的情况,而不会告诉您用户未单击任何搜索结果的搜索。结果。为了衡量搜索质量,您需要准确的点击率,为此您需要搜索事件和点击事件。

Note that embedding the details of the search in the click event is not equivalent to joining the events: doing so would only tell you about the cases where the user clicked a search result, not about the searches where the user did not click any of the results. In order to measure search quality, you need accurate click-through rates, for which you need both the search events and the click events.

为了实现这种类型的连接,流处理器需要维护状态:例如,过去一小时内发生的所有事件,按会话 ID 进行索引。每当搜索事件或点击事件发生时,它都会被添加到适当的索引中,并且流处理器还会检查另一个索引以查看同一会话 ID 的另一个事件是否已经到达。如果存在匹配的事件,您将发出一个事件,说明单击了哪个搜索结果。如果搜索事件过期而您没有看到匹配的点击事件,您将发出一个事件,说明哪些搜索结果未被点击。

To implement this type of join, a stream processor needs to maintain state: for example, all the events that occurred in the last hour, indexed by session ID. Whenever a search event or click event occurs, it is added to the appropriate index, and the stream processor also checks the other index to see if another event for the same session ID has already arrived. If there is a matching event, you emit an event saying which search result was clicked. If the search event expires without you seeing a matching click event, you emit an event saying which search results were not clicked.

流表连接(流丰富)

Stream-table join (stream enrichment)

“示例:用户活动事件分析”图 10-2)中,我们看到了一个连接两个数据集的批处理作业示例:一组用户活动事件和一个用户配置文件数据库。很自然地将用户活动事件视为流,并在流处理器中连续执行相同的连接:输入是包含用户 ID 的活动事件流,输出是活动流用户 ID 已使用有关用户的配置文件信息进行扩充的事件。此过程有时称为使用数据库中的信息丰富活动事件。

In “Example: analysis of user activity events” (Figure 10-2) we saw an example of a batch job joining two datasets: a set of user activity events and a database of user profiles. It is natural to think of the user activity events as a stream, and to perform the same join on a continuous basis in a stream processor: the input is a stream of activity events containing a user ID, and the output is a stream of activity events in which the user ID has been augmented with profile information about the user. This process is sometimes known as enriching the activity events with information from the database.

要执行此连接,流流程需要一次查看一个活动事件,在数据库中查找该事件的用户 ID,并将配置文件信息添加到该活动事件。数据库查找可以通过查询远程数据库来实现;然而,正如 “示例:用户活动事件分析”中所讨论的,此类远程查询可能很慢并且存在数据库超载的风险[ 75 ]。

To perform this join, the stream process needs to look at one activity event at a time, look up the event’s user ID in the database, and add the profile information to the activity event. The database lookup could be implemented by querying a remote database; however, as discussed in “Example: analysis of user activity events”, such remote queries are likely to be slow and risk overloading the database [75].

另一种方法是将数据库的副本加载到流处理器中,以便可以在本地查询它,而无需网络往返。这种技术与我们在“Map-Side Joins”中讨论的哈希联接非常相似:数据库的本地副本可能是内存中的哈希表(如果足够小),也可能是本地磁盘上的索引。

Another approach is to load a copy of the database into the stream processor so that it can be queried locally without a network round-trip. This technique is very similar to the hash joins we discussed in “Map-Side Joins”: the local copy of the database might be an in-memory hash table if it is small enough, or an index on the local disk.

与批处理作业的区别在于,批处理作业使用数据库的时间点快照作为输入,而流处理器是长时间运行的,并且数据库的内容可能会随着时间的推移而变化,因此流处理器的数据库的本地副本需要保持最新。这个问题可以通过变更数据捕获来解决:流处理器可以订阅用户配置文件数据库的变更日志以及活动事件流。创建或修改配置文件时,流处理器会更新其本地副本。因此,我们获得了两个流之间的连接:活动事件和配置文件更新。

The difference to batch jobs is that a batch job uses a point-in-time snapshot of the database as input, whereas a stream processor is long-running, and the contents of the database are likely to change over time, so the stream processor’s local copy of the database needs to be kept up to date. This issue can be solved by change data capture: the stream processor can subscribe to a changelog of the user profile database as well as the stream of activity events. When a profile is created or modified, the stream processor updates its local copy. Thus, we obtain a join between two streams: the activity events and the profile updates.

流-表连接实际上与流-流连接非常相似;最大的区别是,对于表变更日志流,连接使用一个可以追溯到“时间开始”的窗口(概念上无限的窗口),新版本的记录会覆盖旧版本。对于流输入,连接可能根本不维护窗口。

A stream-table join is actually very similar to a stream-stream join; the biggest difference is that for the table changelog stream, the join uses a window that reaches back to the “beginning of time” (a conceptually infinite window), with newer versions of records overwriting older ones. For the stream input, the join might not maintain a window at all.

表-表连接(物化视图维护)

Table-table join (materialized view maintenance)

考虑我们在“描述负载” 中讨论的 Twitter 时间线示例。我们说过,当用户想要查看他们的主页时间线时,迭代该用户关注的所有人员、查找他们最近的推文并合并它们的成本太高。

Consider the Twitter timeline example that we discussed in “Describing Load”. We said that when a user wants to view their home timeline, it is too expensive to iterate over all the people the user is following, find their recent tweets, and merge them.

相反,我们需要一个时间线缓存:一种每用户的“收件箱”,推文在发送时会写入其中,这样读取时间线就是一次查找。实现和维护此缓存需要以下事件处理:

Instead, we want a timeline cache: a kind of per-user “inbox” to which tweets are written as they are sent, so that reading the timeline is a single lookup. Materializing and maintaining this cache requires the following event processing:

  • 当用户u发送一条新推文时,它会被添加到关注u的每个用户的时间线中。

  • When user u sends a new tweet, it is added to the timeline of every user who is following u.

  • 当用户删除一条推文时,该推文就会从所有用户的时间轴中删除。

  • When a user deletes a tweet, it is removed from all users’ timelines.

  • 当用户u 1开始关注用户u 2时,u 2最近的推文将添加到 u 1时间线中。

  • When user u1 starts following user u2, recent tweets by u2 are added to u1’s timeline.

  • 当用户u 1取消关注用户u 2时,u 2的推文将从u 1的时间线中删除。

  • When user u1 unfollows user u2, tweets by u2 are removed from u1’s timeline.

要在流处理器中实现此缓存维护,您需要推文(发送和删除)和关注关系(关注和取消关注)的事件流。流进程需要维护一个包含每个用户的关注者集的数据库,以便它知道当新推文到达时需要更新哪些时间线[ 86 ]。

To implement this cache maintenance in a stream processor, you need streams of events for tweets (sending and deleting) and for follow relationships (following and unfollowing). The stream process needs to maintain a database containing the set of followers for each user so that it knows which timelines need to be updated when a new tweet arrives [86].

查看此流过程的另一种方式是,它为连接两个表(推文和关注)的查询维护一个物化视图,如下所示:

Another way of looking at this stream process is that it maintains a materialized view for a query that joins two tables (tweets and follows), something like the following:

SELECT follows.follower_id AS timeline_id,
  array_agg(tweets.* ORDER BY tweets.timestamp DESC)
FROM tweets
JOIN follows ON follows.followee_id = tweets.sender_id
GROUP BY follows.follower_id
SELECT follows.follower_id AS timeline_id,
  array_agg(tweets.* ORDER BY tweets.timestamp DESC)
FROM tweets
JOIN follows ON follows.followee_id = tweets.sender_id
GROUP BY follows.follower_id

流的联接直接对应于该查询中表的联接。时间线实际上是该查询结果的缓存,每次基础表发生变化时都会更新。三、

The join of the streams corresponds directly to the join of the tables in that query. The timelines are effectively a cache of the result of this query, updated every time the underlying tables change.iii

连接的时间依赖性

Time-dependence of joins

这里描述的三种类型的连接(流-流、流-表和表-表)有很多共同点:它们都需要流处理器维护某种状态(搜索和单击事件、用户配置文件或关注者列表)基于一个联接输入,并查询来自另一联接输入的消息的状态。

The three types of joins described here (stream-stream, stream-table, and table-table) have a lot in common: they all require the stream processor to maintain some state (search and click events, user profiles, or follower list) based on one join input, and query that state on messages from the other join input.

维持状态的事件的顺序很重要(无论您是先关注然后取消关注,还是反之亦然)。在分区日志中,单个分区内的事件顺序被保留,但通常不保证跨不同流或分区的顺序。

The order of the events that maintain the state is important (it matters whether you first follow and then unfollow, or the other way round). In a partitioned log, the ordering of events within a single partition is preserved, but there is typically no ordering guarantee across different streams or partitions.

这就提出了一个问题:如果不同流上的事件在相似的时间发生,那么它们的处理顺序是什么?在流表连接示例中,如果用户更新其配置文件,哪些活动事件会与旧配置文件连接(在配置文件更新之前处理),哪些活动事件会与新配置文件连接(在配置文件更新之后处理)?换句话说:如果状态随着时间的推移而变化,并且您以某种状态加入,那么您使用什么时间点进行加入[ 45 ]?

This raises a question: if events on different streams happen around a similar time, in which order are they processed? In the stream-table join example, if a user updates their profile, which activity events are joined with the old profile (processed before the profile update), and which are joined with the new profile (processed after the profile update)? Put another way: if state changes over time, and you join with some state, what point in time do you use for the join [45]?

这种时间依赖性可能发生在许多地方。例如,如果您销售商品,则需要对发票应用正确的税率,这取决于国家或州、产品类型以及销售日期(因为税率会不时变化)。将销售额连接到税率表时,您可能希望连接销售时的税率,如果您正在重新处理历史数据,则该税率可能与当前税率不同。

Such time dependence can occur in many places. For example, if you sell things, you need to apply the right tax rate to invoices, which depends on the country or state, the type of product, and the date of sale (since tax rates change from time to time). When joining sales to a table of tax rates, you probably want to join with the tax rate at the time of the sale, which may be different from the current tax rate if you are reprocessing historical data.

如果跨流的事件顺序不确定,则连接将变得不确定[ 87 ],这意味着您无法在相同的输入上重新运​​行相同的作业并必然获得相同的结果:输入流上的事件可能以不同的方式交错当您再次运行该作业时。

If the ordering of events across streams is undetermined, the join becomes nondeterministic [87], which means you cannot rerun the same job on the same input and necessarily get the same result: the events on the input streams may be interleaved in a different way when you run the job again.

在数据仓库中,这个问题被称为缓慢变化维度(SCD),通常通过对连接记录的特定版本使用唯一标识符来解决:例如,每次税率发生变化时,都会给出一个新标识符,发票包含销售时税率的标识符 [ 88 , 89 ]。此更改使连接具有确定性,但会导致无法进行日志压缩,因为需要保留表中记录的所有版本。

In data warehouses, this issue is known as a slowly changing dimension (SCD), and it is often addressed by using a unique identifier for a particular version of the joined record: for example, every time the tax rate changes, it is given a new identifier, and the invoice includes the identifier for the tax rate at the time of sale [88, 89]. This change makes the join deterministic, but has the consequence that log compaction is not possible, since all versions of the records in the table need to be retained.

容错能力

Fault Tolerance

在本章的最后一节,我们来考虑流处理器如何容忍错误。我们在第 10 章中看到,批处理框架可以相当容易地容忍错误:如果 MapReduce 作业中的任务失败,只需在另一台机器上重新启动它即可,并且失败任务的输出将被丢弃。这种透明的重试是可能的,因为输入文件是不可变的,每个任务将其输出写入 HDFS 上的单独文件,并且只有在任务成功完成时输出才可见。

In the final section of this chapter, let’s consider how stream processors can tolerate faults. We saw in Chapter 10 that batch processing frameworks can tolerate faults fairly easily: if a task in a MapReduce job fails, it can simply be started again on another machine, and the output of the failed task is discarded. This transparent retry is possible because input files are immutable, each task writes its output to a separate file on HDFS, and output is only made visible when a task completes successfully.

特别是,批处理容错方法可确保批处理作业的输出与没有发生任何错误一样,即使实际上某些任务确实失败了。看起来好像每个输入记录都被处理了一次——没有记录被跳过,也没有记录被处理两次。尽管重新启动任务意味着记录实际上可能会被处理多次,但输出中的可见效果就像它们只被处理过一次一样。这一原则被称为 精确一次语义,尽管有效一次将是一个更具描述性的术语[ 90 ]。

In particular, the batch approach to fault tolerance ensures that the output of the batch job is the same as if nothing had gone wrong, even if in fact some tasks did fail. It appears as though every input record was processed exactly once—no records are skipped, and none are processed twice. Although restarting tasks means that records may in fact be processed multiple times, the visible effect in the output is as if they had only been processed once. This principle is known as exactly-once semantics, although effectively-once would be a more descriptive term [90].

流处理中也会出现同样的容错问题,但处理起来不太简单:等待任务完成后再使其输出可见不是一个选择,因为流是无限的,因此您永远无法完成处理它。

The same issue of fault tolerance arises in stream processing, but it is less straightforward to handle: waiting until a task is finished before making its output visible is not an option, because a stream is infinite and so you can never finish processing it.

微批处理和检查点

Microbatching and checkpointing

一种解决方案是将流分成小块,并将每个块视为微型批处理过程。这种方法称为微批处理,用于 Spark Streaming [ 91 ]。批处理大小通常约为一秒,这是性能妥协的结果:较小的批处理会产生更大的调度和协调开销,而较大的批处理意味着流处理器的结果可见之前的延迟较长。

One solution is to break the stream into small blocks, and treat each block like a miniature batch process. This approach is called microbatching, and it is used in Spark Streaming [91]. The batch size is typically around one second, which is the result of a performance compromise: smaller batches incur greater scheduling and coordination overhead, while larger batches mean a longer delay before results of the stream processor become visible.

微批处理还隐式提供了一个等于批处理大小的滚动窗口(按处理时间而不是事件时间戳来窗口化);任何需要更大窗口的作业都需要显式地将状态从一个微批次转移到下一个微批次。

Microbatching also implicitly provides a tumbling window equal to the batch size (windowed by processing time, not event timestamps); any jobs that require larger windows need to explicitly carry over state from one microbatch to the next.

Apache Flink 中使用的一种变体方法是定期生成状态滚动检查点并将其写入持久存储 [ 92 , 93 ]。如果流运算符崩溃,它可以从最近的检查点重新启动,并丢弃上一个检查点和崩溃之间生成的任何输出。检查点由消息流中的屏障触发,类似于微批次之间的边界,但不强制特定的窗口大小。

A variant approach, used in Apache Flink, is to periodically generate rolling checkpoints of state and write them to durable storage [92, 93]. If a stream operator crashes, it can restart from its most recent checkpoint and discard any output generated between the last checkpoint and the crash. The checkpoints are triggered by barriers in the message stream, similar to the boundaries between microbatches, but without forcing a particular window size.

在流处理框架的范围内,微批处理和检查点方法提供与批处理相同的一次性语义。但是,一旦输出离开流处理器(例如,通过写入数据库、向外部消息代理发送消息或发送电子邮件),框架就无法再丢弃失败批次的输出。在这种情况下,重新启动失败的任务会导致外部副作用发生两次,而单独的微批处理或检查点不足以防止此问题。

Within the confines of the stream processing framework, the microbatching and checkpointing approaches provide the same exactly-once semantics as batch processing. However, as soon as output leaves the stream processor (for example, by writing to a database, sending messages to an external message broker, or sending emails), the framework is no longer able to discard the output of a failed batch. In this case, restarting a failed task causes the external side effect to happen twice, and microbatching or checkpointing alone is not sufficient to prevent this problem.

重新审视原子提交

Atomic commit revisited

为了在出现故障的情况下呈现一次性处理的效果,我们需要确保处理事件的所有输出和副作用当且仅当处理成功时才会生效。这些影响包括发送到下游操作员或外部消息系统的任何消息(包括电子邮件或推送通知)、任何数据库写入、操作员状态的任何更改以及输入消息的任何确认(包括在基于日志的消息中向前移动消费者偏移量)经纪人)。

In order to give the appearance of exactly-once processing in the presence of faults, we need to ensure that all outputs and side effects of processing an event take effect if and only if the processing is successful. Those effects include any messages sent to downstream operators or external messaging systems (including email or push notifications), any database writes, any changes to operator state, and any acknowledgment of input messages (including moving the consumer offset forward in a log-based message broker).

这些事情要么都需要原子地发生,要么都必须发生,但它们不应该彼此不同步。如果这种方法听起来很熟悉,那是因为我们在分布式事务和两阶段提交的上下文中的“一次性消息处理”中讨论了它 。

Those things either all need to happen atomically, or none of them must happen, but they should not go out of sync with each other. If this approach sounds familiar, it is because we discussed it in “Exactly-once message processing” in the context of distributed transactions and two-phase commit.

第9章中我们讨论了分布式事务的传统实现中的问题,例如XA。然而,在更受限制的环境中,可以有效地实现这样的原子提交设施。这种方法用于 Google Cloud Dataflow [ 81 , 92 ] 和 VoltDB [ 94 ],并且计划向 Apache Kafka [ 95 , 96]添加类似的功能]。与 XA 不同,这些实现并不尝试跨异构技术提供事务,而是通过在流处理框架内管理状态更改和消息传递来将它们保留在内部。事务协议的开销可以通过在单个事务中处理多个输入消息来摊销。

In Chapter 9 we discussed the problems in the traditional implementations of distributed transactions, such as XA. However, in more restricted environments it is possible to implement such an atomic commit facility efficiently. This approach is used in Google Cloud Dataflow [81, 92] and VoltDB [94], and there are plans to add similar features to Apache Kafka [95, 96]. Unlike XA, these implementations do not attempt to provide transactions across heterogeneous technologies, but instead keep them internal by managing both state changes and messaging within the stream processing framework. The overhead of the transaction protocol can be amortized by processing several input messages within a single transaction.

幂等性

Idempotence

我们的目标是丢弃任何失败任务的部分输出,以便可以安全地重试它们,而不会两次生效。分布式事务是实现该目标的一种方法,但另一种方法是依赖幂等性[ 97 ]。

Our goal is to discard the partial output of any failed tasks so that they can be safely retried without taking effect twice. Distributed transactions are one way of achieving that goal, but another way is to rely on idempotence [97].

幂等操作是可以执行多次的操作,并且其效果与仅执行一次相同。例如,将键值存储中的键设置为某个固定值是幂等的(再次写入该值只会用相同的值覆盖该值),而递增计数器则不是幂等的(再次执行递增意味着该值会递增)两次)。

An idempotent operation is one that you can perform multiple times, and it has the same effect as if you performed it only once. For example, setting a key in a key-value store to some fixed value is idempotent (writing the value again simply overwrites the value with an identical value), whereas incrementing a counter is not idempotent (performing the increment again means the value is incremented twice).

即使操作本身不是幂等的,通常也可以通过一些额外的元数据使其具有幂等性。例如,当使用来自 Kafka 的消息时,每条消息都有一个持久的、单调递增的偏移量。将值写入外部数据库时,您可以包含触发上次写入该值的消息的偏移量。因此,您可以判断是否已应用更新,并避免再次执行相同的更新。

Even if an operation is not naturally idempotent, it can often be made idempotent with a bit of extra metadata. For example, when consuming messages from Kafka, every message has a persistent, monotonically increasing offset. When writing a value to an external database, you can include the offset of the message that triggered the last write with the value. Thus, you can tell whether an update has already been applied, and avoid performing the same update again.

Storm's Trident 中的状态处理基于类似的想法[ 78 ]。依赖幂等性意味着几个假设:重新启动失败的任务必须以相同的顺序重播相同的消息(基于日志的消息代理执行此操作),处理必须是确定性的,并且没有其他节点可以同时更新相同的值[ 9899 ]。

The state handling in Storm’s Trident is based on a similar idea [78]. Relying on idempotence implies several assumptions: restarting a failed task must replay the same messages in the same order (a log-based message broker does this), the processing must be deterministic, and no other node may concurrently update the same value [98, 99].

当从一个处理节点故障转移到另一个处理节点时,可能需要隔离(请参阅 “领导者和锁”),以防止来自被认为已死亡但实际上还活着的节点的干扰。尽管存在所有这些警告,幂等操作仍然是一种只需很小的开销即可实现一次语义的有效方法。

When failing over from one processing node to another, fencing may be required (see “The leader and the lock”) to prevent interference from a node that is thought to be dead but is actually alive. Despite all those caveats, idempotent operations can be an effective way of achieving exactly-once semantics with only a small overhead.

失败后重建状态

Rebuilding state after a failure

任何需要状态的流进程(例如,任何窗口聚合(例如计数器、平均值和直方图)以及用于连接的任何表和索引)都必须确保在发生故障后可以恢复该状态。

Any stream process that requires state—for example, any windowed aggregations (such as counters, averages, and histograms) and any tables and indexes used for joins—must ensure that this state can be recovered after a failure.

一种选择是将状态保留在远程数据存储中并复制它,尽管必须在远程数据库中查询每条单独的消息可能会很慢,如“流表连接(流丰富)”中所述。另一种方法是将状态保留在流处理器本地,并定期复制它。然后,当流处理器从故障中恢复时,新任务可以读取复制的状态并恢复处理,而不会丢失数据。

One option is to keep the state in a remote datastore and replicate it, although having to query a remote database for each individual message can be slow, as discussed in “Stream-table join (stream enrichment)”. An alternative is to keep state local to the stream processor, and replicate it periodically. Then, when the stream processor is recovering from a failure, the new task can read the replicated state and resume processing without data loss.

例如,Flink 定期捕获算子状态快照并将其写入持久存储,例如 HDFS [ 92 , 93 ];Samza 和 Kafka Streams 通过将状态更改发送到带有日志压缩的专用 Kafka 主题来复制状态更改,类似于更改数据捕获 [ 84 , 100 ]。VoltDB 通过在多个节点上冗余处理每个输入消息来复制状态(请参阅 “实际串行执行”)。

For example, Flink periodically captures snapshots of operator state and writes them to durable storage such as HDFS [92, 93]; Samza and Kafka Streams replicate state changes by sending them to a dedicated Kafka topic with log compaction, similar to change data capture [84, 100]. VoltDB replicates state by redundantly processing each input message on several nodes (see “Actual Serial Execution”).

在某些情况下,甚至可能不需要复制状态,因为它可以从输入流重建。例如,如果状态由相当短的窗口上的聚合组成,则它可能足够快,可以简单地重放与该窗口相对应的输入事件。如果状态是通过更改数据捕获维护的数据库的本地副本,则还可以从日志压缩的更改流重建数据库(请参阅“日志压缩”)。

In some cases, it may not even be necessary to replicate the state, because it can be rebuilt from the input streams. For example, if the state consists of aggregations over a fairly short window, it may be fast enough to simply replay the input events corresponding to that window. If the state is a local replica of a database, maintained by change data capture, the database can also be rebuilt from the log-compacted change stream (see “Log compaction”).

然而,所有这些权衡都取决于底层基础设施的性能特征:在某些系统中,网络延迟可能低于磁盘访问延迟,而网络带宽可能与磁盘带宽相当。不存在适合所有情况的普遍理想的权衡,并且随着存储和网络技术的发展,本地状态与远程状态的优点也可能发生变化。

However, all of these trade-offs depend on the performance characteristics of the underlying infrastructure: in some systems, network delay may be lower than disk access latency, and network bandwidth may be comparable to disk bandwidth. There is no universally ideal trade-off for all situations, and the merits of local versus remote state may also shift as storage and networking technologies evolve.

概括

Summary

在本章中,我们讨论了事件流、它们的用途以及如何处理它们。在某些方面,流处理非常类似于我们在第 10 章中讨论的批处理,但在无界(永无止境)流上连续完成,而不是在固定大小的输入上完成。从这个角度来看,消息代理和事件日志相当于文件系统的流式传输。

In this chapter we have discussed event streams, what purposes they serve, and how to process them. In some ways, stream processing is very much like the batch processing we discussed in Chapter 10, but done continuously on unbounded (never-ending) streams rather than on a fixed-size input. From this perspective, message brokers and event logs serve as the streaming equivalent of a filesystem.

我们花了一些时间比较两种类型的消息代理:

We spent some time comparing two types of message brokers:

AMQP/JMS 风格的消息代理
AMQP/JMS-style message broker

代理将各个消息分配给消费者,消费者在成功处理各个消息后确认这些消息。消息被确认后将从代理中删除。这种方法适合作为 RPC 的异步形式(另请参阅 “消息传递数据流”),例如在任务队列中,其中消息处理的确切顺序并不重要,并且无需返回并读取旧的内容。消息处理完毕后再次发送。

The broker assigns individual messages to consumers, and consumers acknowledge individual messages when they have been successfully processed. Messages are deleted from the broker once they have been acknowledged. This approach is appropriate as an asynchronous form of RPC (see also “Message-Passing Dataflow”), for example in a task queue, where the exact order of message processing is not important and where there is no need to go back and read old messages again after they have been processed.

基于日志的消息代理
Log-based message broker

代理将分区中的所有消息分配给同一个消费者节点,并且始终以相同的顺序传递消息。并行性是通过分区实现的,消费者通过检查他们处理的最后一条消息的偏移量来跟踪他们的进度。代理将消息保留在磁盘上,因此可以在必要时跳回并重新读取旧消息。

The broker assigns all messages in a partition to the same consumer node, and always delivers messages in the same order. Parallelism is achieved through partitioning, and consumers track their progress by checkpointing the offset of the last message they have processed. The broker retains messages on disk, so it is possible to jump back and reread old messages if necessary.

基于日志的方法与数据库(参见 第 5 章)和日志结构存储引擎(参见第 3 章)中的复制日志有相似之处。我们看到这种方法特别适用于消耗输入流并生成派生状态或派生输出流的流处理系统。

The log-based approach has similarities to the replication logs found in databases (see Chapter 5) and log-structured storage engines (see Chapter 3). We saw that this approach is especially appropriate for stream processing systems that consume input streams and generate derived state or derived output streams.

关于流的来源,我们讨论了几种可能性:用户活动事件、提供定期读数的传感器和数据源(例如,金融中的市场数据)自然地表示为流。我们发现,将数据库写入视为流也很有用:我们可以通过更改数据捕获隐式捕获或通过事件源显式捕获更改日志(即对数据库所做的所有更改的历史记录)。日志压缩允许流保留数据库内容的完整副本。

In terms of where streams come from, we discussed several possibilities: user activity events, sensors providing periodic readings, and data feeds (e.g., market data in finance) are naturally represented as streams. We saw that it can also be useful to think of the writes to a database as a stream: we can capture the changelog—i.e., the history of all changes made to a database—either implicitly through change data capture or explicitly through event sourcing. Log compaction allows the stream to retain a full copy of the contents of a database.

将数据库表示为流为集成系统提供了强大的机会。您可以通过使用更改日志并将其应用到派生系统,使派生数据系统(例如搜索索引、缓存和分析系统)持续保持最新状态。您甚至可以通过从头开始并使用从开始到现在的更改日志来构建现有数据的新视图。

Representing databases as streams opens up powerful opportunities for integrating systems. You can keep derived data systems such as search indexes, caches, and analytics systems continually up to date by consuming the log of changes and applying them to the derived system. You can even build fresh views onto existing data by starting from scratch and consuming the log of changes from the beginning all the way to the present.

用于将状态维护为流和重放消息的设施也是在各种流处理框架中实现流连接和容错的技术的基础。我们讨论了流处理的几个目的,包括搜索事件模式(复杂事件处理)、计算窗口聚合(流分析)以及保持派生数据系统最新(物化视图)。

The facilities for maintaining state as streams and replaying messages are also the basis for the techniques that enable stream joins and fault tolerance in various stream processing frameworks. We discussed several purposes of stream processing, including searching for event patterns (complex event processing), computing windowed aggregations (stream analytics), and keeping derived data systems up to date (materialized views).

然后,我们讨论了在流处理器中推理时间的困难,包括处理时间和事件时间戳之间的区别,以及处理在您认为窗口完成后到达的落后事件的问题。

We then discussed the difficulties of reasoning about time in a stream processor, including the distinction between processing time and event timestamps, and the problem of dealing with straggler events that arrive after you thought your window was complete.

我们区分了流进程中可能出现的三种类型的连接:

We distinguished three types of joins that may appear in stream processes:

流-流连接
Stream-stream joins

两个输入流都包含活动事件,并且连接运算符搜索在某个时间窗口内发生的相关事件。例如,它可以匹配同一用户在 30 分钟内执行的两个操作。 如果您想在一个流中查找相关事件,则两个连接输入实际上可能是同一流(自连接)。

Both input streams consist of activity events, and the join operator searches for related events that occur within some window of time. For example, it may match two actions taken by the same user within 30 minutes of each other. The two join inputs may in fact be the same stream (a self-join) if you want to find related events within that one stream.

流表连接
Stream-table joins

一个输入流由活动事件组成,另一个输入流是数据库更改日志。更改日志使数据库的本地副本保持最新。对于每个活动事件,连接运算符查询数据库并输出丰富的活动事件。

One input stream consists of activity events, while the other is a database changelog. The changelog keeps a local copy of the database up to date. For each activity event, the join operator queries the database and outputs an enriched activity event.

表-表连接
Table-table joins

两个输入流都是数据库变更日志。在这种情况下,一侧的每个更改都会与另一侧的最新状态结合在一起。结果是两个表之间连接的物化视图发生一系列更改。

Both input streams are database changelogs. In this case, every change on one side is joined with the latest state of the other side. The result is a stream of changes to the materialized view of the join between the two tables.

最后,我们讨论了在流处理器中实现容错和一次性语义的技术。与批处理一样,我们需要丢弃任何失败任务的部分输出。然而,由于流进程是长时间运行的并且不断产生输出,所以我们不能简单地丢弃所有输出。相反,可以使用基于微批处理、检查点、事务或幂等写入的更细粒度的恢复机制。

Finally, we discussed techniques for achieving fault tolerance and exactly-once semantics in a stream processor. As with batch processing, we need to discard the partial output of any failed tasks. However, since a stream process is long-running and produces output continuously, we can’t simply discard all output. Instead, a finer-grained recovery mechanism can be used, based on microbatching, checkpointing, transactions, or idempotent writes.

脚注

i可以创建一种负载平衡方案,其中两个消费者通过读取完整的消息集来分担处理分区的工作,但其中一个仅考虑具有偶数偏移量的消息,而另一个则处理奇数偏移量的消息。编号的偏移量。或者,您可以将消息处理分散到线程池上,但这种方法会使消费者偏移量管理变得复杂。一般来说,一个分区的单线程处理是更可取的,并且可以通过使用更多的分区来增加并行度。

i It’s possible to create a load balancing scheme in which two consumers share the work of processing a partition by having both read the full set of messages, but one of them only considers messages with even-numbered offsets while the other deals with the odd-numbered offsets. Alternatively, you could spread message processing over a thread pool, but that approach complicates consumer offset management. In general, single-threaded processing of a partition is preferable, and parallelism can be increased by using more partitions.

ii感谢 Flink 社区的 Kostas Kloudas 提出了这个类比。

ii Thank you to Kostas Kloudas from the Flink community for coming up with this analogy.

iii如果将流视为表的派生,如图11-6所示,并将联接视为两个表u·v的乘积,则会发生一些有趣的事情:物化联接的更改流遵循乘积规则( u·v )′ =  u v  +  uv ′。换句话说:推文的任何变化都会与当前关注者连接,关注者的任何变化都会与当前推文连接[ 49 , 50 ]。

iii If you regard a stream as the derivative of a table, as in Figure 11-6, and regard a join as a product of two tables u·v, something interesting happens: the stream of changes to the materialized join follows the product rule (u·v)′ = uv + uv′. In words: any change of tweets is joined with the current followers, and any change of followers is joined with the current tweets [49, 50].

参考

[ 1 ] Tyler Akidau、Robert Bradshaw、Craig Chambers 等人:“数据流模型:在大规模、无界、无序数据处理中平衡正确性、延迟和成本的实用方法”, 《Proceedings of》 VLDB 捐赠基金,第 8 卷,第 12 期,第 1792–1803 页,2015 年 8 月 。doi:10.14778/2824032.2824076

[1] Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing,” Proceedings of the VLDB Endowment, volume 8, number 12, pages 1792–1803, August 2015. doi:10.14778/2824032.2824076

[ 2 ] Harold Abelson、Gerald Jay Sussman 和 Julie Sussman: 计算机程序的结构和解释,第二版。麻省理工学院出版社,1996 年。ISBN:978-0-262-51087-5,可在mitpress.mit.edu在线获取

[2] Harold Abelson, Gerald Jay Sussman, and Julie Sussman: Structure and Interpretation of Computer Programs, 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at mitpress.mit.edu

[ 3 ] 帕特里克·Th. Eugster、Pascal A. Felber、Rachid Guerraoui 和 Anne-Marie Kermarrec:“发布/订阅的诸多方面”, ACM 计算调查,第 35 卷,第 2 期,第 114–131 页,2003 年 6 月 。doi:10.1145/857076.857078

[3] Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “The Many Faces of Publish/Subscribe,” ACM Computing Surveys, volume 35, number 2, pages 114–131, June 2003. doi:10.1145/857076.857078

[ 4 ] Joseph M. Hellerstein 和 Michael Stonebraker: 数据库系统读物,第 4 版。麻省理工学院出版社,2005 年。ISBN:978-0-262-69314-1,可在redbook.cs.berkeley.edu在线获取

[4] Joseph M. Hellerstein and Michael Stonebraker: Readings in Database Systems, 4th edition. MIT Press, 2005. ISBN: 978-0-262-69314-1, available online at redbook.cs.berkeley.edu

[ 5 ]Don Carney、Uğur Çetintemel、Mitch Cherniack 等人:“监控流 – 一种新的数据管理应用程序”,第 28 届国际超大型数据库会议 (VLDB),2002 年 8 月。

[5] Don Carney, Uğur Çetintemel, Mitch Cherniack, et al.: “Monitoring Streams – A New Class of Data Management Applications,” at 28th International Conference on Very Large Data Bases (VLDB), August 2002.

[ 6 ] 马修·萨克曼:“反击”, lshift.net,2016 年 5 月 5 日。

[6] Matthew Sackman: “Pushing Back,” lshift.net, May 5, 2016.

[ 7 ] Vicent Martí:“ Brubeck,一个与 statsd 兼容的指标聚合器”,githubengineering.com,2015 年 6 月 15 日。

[7] Vicent Martí: “Brubeck, a statsd-Compatible Metrics Aggregator,” githubengineering.com, June 15, 2015.

[ 8 ] Seth Lowenberger:“ MoldUDP64 协议规范 V 1.00 ”,nasdaqtrader.com,2009 年 7 月。

[8] Seth Lowenberger: “MoldUDP64 Protocol Specification V 1.00,” nasdaqtrader.com, July 2009.

[ 9 ] Pieter Hintjens: ZeroMQ – 指南。奥莱利媒体,2013 年。ISBN:978-1-449-33404-8

[9] Pieter Hintjens: ZeroMQ – The Guide. O’Reilly Media, 2013. ISBN: 978-1-449-33404-8

[ 10 ] Ian Malpass:“测量任何事物,测量一切”,codeascraft.com,2011 年 2 月 15 日。

[10] Ian Malpass: “Measure Anything, Measure Everything,” codeascraft.com, February 15, 2011.

[ 11 ] Dieter Plaetinck:“ 25 Graphite、Grafana 和 statsd 陷阱”,blog.raintank.io,2016 年 3 月 3 日。

[11] Dieter Plaetinck: “25 Graphite, Grafana and statsd Gotchas,” blog.raintank.io, March 3, 2016.

[ 12 ] Jeff Lindsay:“ Web Hooks 彻底改变了 Web ”,progrium.com,2007 年 5 月 3 日。

[12] Jeff Lindsay: “Web Hooks to Revolutionize the Web,” progrium.com, May 3, 2007.

[ 13 ] Jim N. Gray:“队列是数据库”,微软研究技术报告 MSR-TR-95-56,1995 年 12 月。

[13] Jim N. Gray: “Queues Are Databases,” Microsoft Research Technical Report MSR-TR-95-56, December 1995.

[ 14 ]Mark Hapner、Rich Burridge、Rahul Sharma 等人:“ JSR-343 Java 消息服务 (JMS) 2.0 规范”,jms-spec.java.net,2013 年 3 月。

[14] Mark Hapner, Rich Burridge, Rahul Sharma, et al.: “JSR-343 Java Message Service (JMS) 2.0 Specification,” jms-spec.java.net, March 2013.

[ 15 ] Sanjay Aiyagari、Matthew Arrott、Mark Atwell 等人:“ AMQP:高级消息队列协议规范”,版本 0-9-1,2008 年 11 月。

[15] Sanjay Aiyagari, Matthew Arrott, Mark Atwell, et al.: “AMQP: Advanced Message Queuing Protocol Specification,” Version 0-9-1, November 2008.

[ 16 ]“ Google Cloud Pub/Sub:Google 规模的消息传递服务”,cloud.google.com,2016 年。

[16] “Google Cloud Pub/Sub: A Google-Scale Messaging Service,” cloud.google.com, 2016.

[ 17 ]“ Apache Kafka 0.9 文档”,kafka.apache.org,2015 年 11 月。

[17] “Apache Kafka 0.9 Documentation,” kafka.apache.org, November 2015.

[ 18 ] Jay Kreps、Neha Narkhede 和 Jun Rao:“ Kafka:用于日志处理的分布式消息系统”,第六届网络与数据库国际研讨会(NetDB),2011 年 6 月。

[18] Jay Kreps, Neha Narkhede, and Jun Rao: “Kafka: A Distributed Messaging System for Log Processing,” at 6th International Workshop on Networking Meets Databases (NetDB), June 2011.

[ 19 ]“ Amazon Kinesis Streams 开发人员指南”,docs.aws.amazon.com,2016 年 4 月。

[19] “Amazon Kinesis Streams Developer Guide,” docs.aws.amazon.com, April 2016.

[ 20 ] Leigh Stewart 和 SijieGuo:“构建 DistributedLog:Twitter 的高性能复制日志服务”,blog.twitter.com,2015 年 9 月 16 日。

[20] Leigh Stewart and Sijie Guo: “Building DistributedLog: Twitter’s High-Performance Replicated Log Service,” blog.twitter.com, September 16, 2015.

[ 21 ]“ DistributedLog 文档”,Twitter, Inc.,distributedlog.io,2016 年 5 月。

[21] “DistributedLog Documentation,” Twitter, Inc., distributedlog.io, May 2016.

[ 22 ] Jay Kreps:“ Apache Kafka 基准测试:每秒 200 万次写入(在三台廉价机器上) ”,engineering.linkedin.com,2014 年 4 月 27 日。

[22] Jay Kreps: “Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines),” engineering.linkedin.com, April 27, 2014.

[ 23 ] Kartik Paramasivam:“我们如何在 LinkedIn 改进和推进 Kafka ”,engineering.linkedin.com,2015 年 9 月 2 日。

[23] Kartik Paramasivam: “How We’re Improving and Advancing Kafka at LinkedIn,” engineering.linkedin.com, September 2, 2015.

[ 24 ] Jay Kreps:“日志:每个软件工程师都应该了解实时数据的统一抽象”, engineering.linkedin.com,2013 年 12 月 16 日。

[24] Jay Kreps: “The Log: What Every Software Engineer Should Know About Real-Time Data’s Unifying Abstraction,” engineering.linkedin.com, December 16, 2013.

[ 25 ] Shirshanka Das、Chavdar Botev、Kapil Surlaker 等人:“全部登上数据总线!”,第三届 ACM 云计算研讨会(SoCC),2012 年 10 月。

[25] Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at 3rd ACM Symposium on Cloud Computing (SoCC), October 2012.

[ 26 ] Yogeshwer Sharma、Philippe Ajoux、Petchean Ang 等人:“ Wormhole:支持地理复制互联网服务的可靠 Pub-Sub ”,第12 届 USENIX 网络系统设计和实现(NSDI) 研讨会,2015 年 5 月。

[26] Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services,” at 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI), May 2015.

[ 27 ] PPS Narayan:“夏尔巴更新”, developer.yahoo.com,6 月 8 日。

[27] P. P. S. Narayan: “Sherpa Update,” developer.yahoo.com, June 8, .

[ 28 ] Martin Kleppmann:“瓶装水:PostgreSQL 和 Kafka 的实时集成”,martin.kleppmann.com,2015 年 4 月 23 日。

[28] Martin Kleppmann: “Bottled Water: Real-Time Integration of PostgreSQL and Kafka,” martin.kleppmann.com, April 23, 2015.

[ 29 ] Ben Osheroff:“ Maxwell 简介,一个 mysql-to-kafka Binlog 处理器”,developer.zendesk.com,2015 年 8 月 20 日。

[29] Ben Osheroff: “Introducing Maxwell, a mysql-to-kafka Binlog Processor,” developer.zendesk.com, August 20, 2015.

[ 30 ] Randall Hauch:“ Debezium 0.2.1 已发布”,debezium.io,2016 年 6 月 10 日。

[30] Randall Hauch: “Debezium 0.2.1 Released,” debezium.io, June 10, 2016.

[ 31 ] Prem Santosh Udaya Shankar:“将 MySQL 表实时流式传输到 Kafka ”,engineeringblog.yelp.com,2016 年 8 月 1 日。

[31] Prem Santosh Udaya Shankar: “Streaming MySQL Tables in Real-Time to Kafka,” engineeringblog.yelp.com, August 1, 2016.

[ 32 ]“ Mongoriver ”,Stripe, Inc.,github.com,2014 年 9 月。

[32] “Mongoriver,” Stripe, Inc., github.com, September 2014.

[ 33 ] Dan Harvey:“使用 Mongo + Kafka 更改数据捕获”,英国 Hadoop 用户组,2015 年 8 月。

[33] Dan Harvey: “Change Data Capture with Mongo + Kafka,” at Hadoop Users Group UK, August 2015.

[ 34 ]“ Oracle GoldenGate 12c:实时访问实时信息”,Oracle 白皮书,2015 年 3 月。

[34] “Oracle GoldenGate 12c: Real-Time Access to Real-Time Information,” Oracle White Paper, March 2015.

[ 35 ]“ Oracle GoldenGate 基础知识:Oracle GoldenGate 的工作原理”,Oracle Corporation,youtube.com,2012 年 11 月。

[35] “Oracle GoldenGate Fundamentals: How Oracle GoldenGate Works,” Oracle Corporation, youtube.com, November 2012.

[ 36 ] Slava Akhmechet:“推进实时网络”,rethinkdb.com,2015 年 1 月 27 日。

[36] Slava Akhmechet: “Advancing the Realtime Web,” rethinkdb.com, January 27, 2015.

[ 37 ]“ Firebase 实时数据库文档”,Google, Inc.,firebase.google.com,2016 年 5 月。

[37] “Firebase Realtime Database Documentation,” Google, Inc., firebase.google.com, May 2016.

[ 38 ]“ Apache CouchDB 1.6 文档”,docs.couchdb.org,2014 年。

[38] “Apache CouchDB 1.6 Documentation,” docs.couchdb.org, 2014.

[ 39 ] Matt DeBergalis:“ Meteor 0.7.0:使用 MongoDB Oplog 代替 Poll-and-Diff 进行可扩展数据库查询”,info.meteor.com,2013 年 12 月 17 日。

[39] Matt DeBergalis: “Meteor 0.7.0: Scalable Database Queries Using MongoDB Oplog Instead of Poll-and-Diff,” info.meteor.com, December 17, 2013.

[ 40 ] “第 15 章。导入和导出实时数据”,VoltDB 6.4 用户手册,docs.voltdb.com,2016 年 6 月。

[40] “Chapter 15. Importing and Exporting Live Data,” VoltDB 6.4 User Manual, docs.voltdb.com, June 2016.

[ 41 ] Neha Narkhede:“宣布 Kafka Connect:构建大规模低延迟数据管道”,confluence.io,2016 年 2 月 18 日。

[41] Neha Narkhede: “Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines,” confluent.io, February 18, 2016.

[ 42 ] Greg Young:“ CQRS 和事件溯源”,Code on the Beach,2014 年 8 月。

[42] Greg Young: “CQRS and Event Sourcing,” at Code on the Beach, August 2014.

[ 43 ] Martin Fowler:“事件溯源”,martinfowler.com,2005 年 12 月 12 日。

[43] Martin Fowler: “Event Sourcing,” martinfowler.com, December 12, 2005.

[ 44 ] Vaughn Vernon: 实施领域驱动设计。艾迪生-韦斯利专业,2013。ISBN:978-0-321-83457-7

[44] Vaughn Vernon: Implementing Domain-Driven Design. Addison-Wesley Professional, 2013. ISBN: 978-0-321-83457-7

[ 45 ] HV Jagadish、Inderpal Singh Mumick 和 Abraham Silberschatz:“查看历史记录数据模型的维护问题”,第 14 届 ACM SIGACT-SIGMOD-SIGART 数据库系统原理研讨会(PODS),1995 年 5 月 。doi:10.1145/ 212433.220201

[45] H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz: “View Maintenance Issues for the Chronicle Data Model,” at 14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), May 1995. doi:10.1145/212433.220201

[ 46 ] “ Event Store 3.5.0 文档”,Event Store LLP,docs.geteventstore.com,2016 年 2 月。

[46] “Event Store 3.5.0 Documentation,” Event Store LLP, docs.geteventstore.com, February 2016.

[ 47 ] Martin Kleppmann: 理解流处理。报告,O'Reilly Media,2016 年 5 月。

[47] Martin Kleppmann: Making Sense of Stream Processing. Report, O’Reilly Media, May 2016.

[ 48 ] Sander Mak:“ Akka 的事件溯源架构”,JavaOne,2014 年 9 月。

[48] Sander Mak: “Event-Sourced Architectures with Akka,” at JavaOne, September 2014.

[ 49 ]朱利安·海德: 个人通讯,2016年6月。

[49] Julian Hyde: personal communication, June 2016.

[ 50 ] Ashish Gupta 和 Inderpal Singh Mumick: 物化视图:技术、实现和应用。麻省理工学院出版社,1999 年。ISBN:978-0-262-57122-7

[50] Ashish Gupta and Inderpal Singh Mumick: Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. ISBN: 978-0-262-57122-7

[ 51 ] Timothy Griffin 和 Leonid Libkin:“ Incremental Maintenance of Views with Duplicates ”,ACM 国际数据管理会议(SIGMOD),1995 年 5 月 。doi:10.1145/223784.223849

[51] Timothy Griffin and Leonid Libkin: “Incremental Maintenance of Views with Duplicates,” at ACM International Conference on Management of Data (SIGMOD), May 1995. doi:10.1145/223784.223849

[ 52 ]Pat Helland:“不变性改变一切”,第七届创新数据系统研究双年度会议(CIDR),2015 年 1 月。

[52] Pat Helland: “Immutability Changes Everything,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[ 53 ] Martin Kleppmann:“计算机科学家的会计”,martin.kleppmann.com,2011 年 3 月 7 日。

[53] Martin Kleppmann: “Accounting for Computer Scientists,” martin.kleppmann.com, March 7, 2011.

[ 54 ] Pat Helland:“会计师不使用橡皮擦”,blogs.msdn.com,2007 年 6 月 14 日。

[54] Pat Helland: “Accountants Don’t Use Erasers,” blogs.msdn.com, June 14, 2007.

[ 55 ] Fangjin Yang:“ Druid、Samza 和 Kafka 的测试:Metamarkets 的元计量” , metamarkets.com,2015 年 6 月 3 日。

[55] Fangjin Yang: “Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets,” metamarkets.com, June 3, 2015.

[ 56 ] Gavin Li、Jianqiu Lv 和 Hang Qi:“ Pistachio:将数据和计算放在一起以实现最快的云计算”,yahoohadoop.tumblr.com,2015 年 4 月 13 日。

[56] Gavin Li, Jianqiu Lv, and Hang Qi: “Pistachio: Co-Locate the Data and Compute for Fastest Cloud Compute,” yahoohadoop.tumblr.com, April 13, 2015.

[ 57 ] Kartik Paramasivam:“流处理难题 – 第 1 部分:杀死 Lambda ”,engineering.linkedin.com,2016 年 6 月 27 日。

[57] Kartik Paramasivam: “Stream Processing Hard Problems – Part 1: Killing Lambda,” engineering.linkedin.com, June 27, 2016.

[ 58 ] Martin Fowler:“ CQRS ”,martinfowler.com,2011 年 7 月 14 日。

[58] Martin Fowler: “CQRS,” martinfowler.com, July 14, 2011.

[ 59 ] Greg Young:“ CQRS 文档”, cqrs.files.wordpress.com,2010 年 11 月。

[59] Greg Young: “CQRS Documents,” cqrs.files.wordpress.com, November 2010.

[ 60 ] Baron Schwartz:“不变性、MVCC 和垃圾收集”,xaprb.com,2013 年 12 月 28 日。

[60] Baron Schwartz: “Immutability, MVCC, and Garbage Collection,” xaprb.com, December 28, 2013.

[ 61 ] Daniel Eloff、Slava Akhmechet、Jay Kreps 等人: “Re: Turning the Database Inside-out with Apache Samza ”,黑客新闻讨论,news.ycombinator.com,2015 年 3 月 4 日。

[61] Daniel Eloff, Slava Akhmechet, Jay Kreps, et al.: “Re: Turning the Database Inside-out with Apache Samza,” Hacker News discussion, news.ycombinator.com, March 4, 2015.

[ 62 ]“ Datomic 开发资源:切除”,Cognitect, Inc.,docs.datomic.com

[62] “Datomic Development Resources: Excision,” Cognitect, Inc., docs.datomic.com.

[ 63 ]“化石文档:从化石中删除内容”,fossil-scm.org,2016。

[63] “Fossil Documentation: Deleting Content from Fossil,” fossil-scm.org, 2016.

[ 64 ] Jay Kreps:“分布式系统的讽刺之处在于,数据丢失确实很容易,但删除数据却出奇地困难” , twitter.com,2015 年 3 月 30 日。

[64] Jay Kreps: “The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard,twitter.com, March 30, 2015.

[ 65 ] David C. Luckham:“ ESP 和 CEP 之间有什么区别?”,complexevents.com,2006 年 8 月 1 日。

[65] David C. Luckham: “What’s the Difference Between ESP and CEP?,” complexevents.com, August 1, 2006.

[ 66 ] Srinath Perera:“流处理和复杂事件处理 (CEP) 有何不同?”,quora.com,2015 年 12 月 3 日。

[66] Srinath Perera: “How Is Stream Processing and Complex Event Processing (CEP) Different?,” quora.com, December 3, 2015.

[ 67 ] Arvind Arasu、Shivnath Babu 和 Jennifer Widom:“ CQL 连续查询语言:语义基础和查询执行” , VLDB Journal,第 15 卷,第 2 期,第 121-142 页,2006 年 6 月 。doi:10.1007/s00778 -004-0147-z

[67] Arvind Arasu, Shivnath Babu, and Jennifer Widom: “The CQL Continuous Query Language: Semantic Foundations and Query Execution,” The VLDB Journal, volume 15, number 2, pages 121–142, June 2006. doi:10.1007/s00778-004-0147-z

[ 68 ] Julian Hyde:“流动的数据:流式 SQL 技术如何帮助解决 Web 2.0 数据危机”,ACM Queue,第 7 卷,第 11 期,2009 年 12 月 。doi:10.1145/1661785.1667562

[68] Julian Hyde: “Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch,” ACM Queue, volume 7, number 11, December 2009. doi:10.1145/1661785.1667562

[ 69 ] “ Esper 参考,版本 5.4.0 ”,EsperTech, Inc.,espertech.com,2016 年 4 月。

[69] “Esper Reference, Version 5.4.0,” EsperTech, Inc., espertech.com, April 2016.

[ 70 ] Zubair Nabi、Eric Bouillet、Andrew Bainbridge 和 Chris Thomas:“ Of Streams and Storms ”,IBM 技术报告,developer.ibm.com,2014 年 4 月。

[70] Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas: “Of Streams and Storms,” IBM technical report, developer.ibm.com, April 2014.

[ 71 ] Milinda Pathirage、Julian Hyde、Yi Pan 和 Beth Plale:“ SamzaSQL:使用流 SQL 进行可扩展的快速数据管理”,IEEE 国际高性能大数据计算研讨会(HPBDC),2016 年 5 月 。doi:10.1109/ IPDPSW.2016.141

[71] Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale: “SamzaSQL: Scalable Fast Data Management with Streaming SQL,” at IEEE International Workshop on High-Performance Big Data Computing (HPBDC), May 2016. doi:10.1109/IPDPSW.2016.141

[ 72 ] Philippe Flajolet、Éric Fusy、Olivier Gandouet 和 Frédéric Meunier:“ HyperLog​Log:近最优基数估计算法的分析”,算法分析会议(AofA),2007 年 6 月。

[72] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier: “HyperLo⁠g​Log: The Analysis of a Near-Optimal Cardinality Estimation Algorithm,” at Conference on Analysis of Algorithms (AofA), June 2007.

[ 73 ] Jay Kreps:“质疑 Lambda 架构”,oreilly.com,2014 年 7 月 2 日。

[73] Jay Kreps: “Questioning the Lambda Architecture,” oreilly.com, July 2, 2014.

[ 74 ] Ian Hellström:“ Apache 流技术概述”,databaseline.wordpress.com,2016 年 3 月 12 日。

[74] Ian Hellström: “An Overview of Apache Streaming Technologies,” databaseline.wordpress.com, March 12, 2016.

[ 75 ] Jay Kreps:“为什么本地状态是流处理中的基本原语”,oreilly.com,2014 年 7 月 31 日。

[75] Jay Kreps: “Why Local State Is a Fundamental Primitive in Stream Processing,” oreilly.com, July 31, 2014.

[ 76 ] Shay Banon:“ Percolator ”,elastic.co,2011 年 2 月 8 日。

[76] Shay Banon: “Percolator,” elastic.co, February 8, 2011.

[ 77 ] Alan Woodward 和 Martin Kleppmann:“使用 Luwak 和 Samza 进行实时全文搜索”,martin.kleppmann.com,2015 年 4 月 13 日。

[77] Alan Woodward and Martin Kleppmann: “Real-Time Full-Text Search with Luwak and Samza,” martin.kleppmann.com, April 13, 2015.

[ 78 ]“ Apache Storm 1.0.1 文档”,storm.apache.org,2016 年 5 月。

[78] “Apache Storm 1.0.1 Documentation,” storm.apache.org, May 2016.

[ 79 ] Tyler Akidau:“超越批处理的世界:流媒体 102 ”,oreilly.com,2016 年 1 月 20 日。

[79] Tyler Akidau: “The World Beyond Batch: Streaming 102,” oreilly.com, January 20, 2016.

[ 80 ] Stephan Ewen:“使用 Apache Flink 进行流式分析”,Kafka 峰会,2016 年 4 月。

[80] Stephan Ewen: “Streaming Analytics with Apache Flink,” at Kafka Summit, April 2016.

[ 81 ] Tyler Akidau、Alex Balikov、Kaya Bekiroğlu 等人:“ MillWheel:互联网规模的容错流处理”,第 39 届超大型数据库国际会议(VLDB),2013 年 8 月。

[81] Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, et al.: “MillWheel: Fault-Tolerant Stream Processing at Internet Scale,” at 39th International Conference on Very Large Data Bases (VLDB), August 2013.

[ 82 ] Alex Dean:“提高 Snowplow 对时间的理解”,snowplowanalytics.com,2015 年 9 月 15 日。

[82] Alex Dean: “Improving Snowplow’s Understanding of Time,” snowplowanalytics.com, September 15, 2015.

[ 83 ]“窗口化(Azure 流分析) ”,Microsoft Azure 参考,msdn.microsoft.com,2016 年 4 月。

[83] “Windowing (Azure Stream Analytics),” Microsoft Azure Reference, msdn.microsoft.com, April 2016.

[ 84 ]“状态管理”,Apache Samza 0.10 文档,samza.apache.org,2015 年 12 月。

[84] “State Management,” Apache Samza 0.10 Documentation, samza.apache.org, December 2015.

[ 85 ] Rajagopal Ananthanarayanan、Venkatesh Basker、Sumit Das 等人:“ Photon:连续数据流的容错和可扩展连接”,ACM 国际数据管理会议(SIGMOD),2013 年 6 月 。doi:10.1145/ 2463676.2465272

[85] Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, et al.: “Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams,” at ACM International Conference on Management of Data (SIGMOD), June 2013. doi:10.1145/2463676.2465272

[ 86 ] Martin Kleppmann:“ Samza Newsfeed 演示”,github.com,2014 年 9 月。

[86] Martin Kleppmann: “Samza Newsfeed Demo,” github.com, September 2014.

[ 87 ] Ben Kirwin:“做不可能的事情:Kafka 中的一次性消息传递模式”,ben.kirw.in,2014 年 11 月 28 日。

[87] Ben Kirwin: “Doing the Impossible: Exactly-Once Messaging Patterns in Kafka,” ben.kirw.in, November 28, 2014.

[ 88 ] Pat Helland:“外部数据与内部数据”,第二届创新数据系统研究双年度会议(CIDR),2005 年 1 月。

[88] Pat Helland: “Data on the Outside Versus Data on the Inside,” at 2nd Biennial Conference on Innovative Data Systems Research (CIDR), January 2005.

[ 89 ] Ralph Kimball 和 Margy Ross: 数据仓库工具包:维度建模权威指南,第 3 版。约翰·威利父子公司,2013 年。ISBN:978-1-118-53080-1

[89] Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1

[ 90 ] Viktor Klang:“我创造了‘有效一次’这个短语,用于使用至少一次 + 幂等操作进行消息处理”, twitter.com,2016 年 10 月 20 日。

[90] Viktor Klang: “I’m coining the phrase ‘effectively-once’ for message processing with at-least-once + idempotent operations,” twitter.com, October 20, 2016.

[ 91 ]Matei Zaharia、Tathagata Das、Haoyuan Li 等人:“离散流:大型集群流处理的高效容错模型”,第 4 届 USENIX 云计算热点主题会议(HotCloud),6 月2012年。

[91] Matei Zaharia, Tathagata Das, Haoyuan Li, et al.: “Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters,” at 4th USENIX Conference in Hot Topics in Cloud Computing (HotCloud), June 2012.

[ 92 ] Kostas Tzoumas、Stephan Ewen 和 Robert Metzger:“使用 Apache Flink 进行高吞吐量、低延迟和一次性流处理”,data-artisans.com,2015 年 8 月 5 日。

[92] Kostas Tzoumas, Stephan Ewen, and Robert Metzger: “High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink,” data-artisans.com, August 5, 2015.

[ 93 ] Paris Carbone、Gyula Fóra、Stephan Ewen 等人:“分布式数据流的轻量级异步快照”,arXiv:1506.08603 [cs.DC],2015 年 6 月 29 日。

[93] Paris Carbone, Gyula Fóra, Stephan Ewen, et al.: “Lightweight Asynchronous Snapshots for Distributed Dataflows,” arXiv:1506.08603 [cs.DC], June 29, 2015.

[ 94 ] Ryan Betts 和 John Hugg: 快速数据:智能且大规模。报告,O'Reilly Media,2015 年 10 月。

[94] Ryan Betts and John Hugg: Fast Data: Smart and at Scale. Report, O’Reilly Media, October 2015.

[ 95 ] Flavio Junqueira:“理解 Exactly-Once Semantics ”,伦敦 Strata+Hadoop World,2016 年 6 月。

[95] Flavio Junqueira: “Making Sense of Exactly-Once Semantics,” at Strata+Hadoop World London, June 2016.

[ 96 ] Jason Gustafson、Flavio Junqueira、Apurva Mehta、Sriram Subramanian 和 Guizhang Wang:“ KIP-98 – 一次性交付和事务性消息传递”,cwiki.apache.org,2016 年 11 月。

[96] Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang: “KIP-98 – Exactly Once Delivery and Transactional Messaging,” cwiki.apache.org, November 2016.

[ 97 ] Pat Helland:“幂等性不是一种医疗状况”,ACM 通讯,第 55 卷,第 5 期,第 56 页,2012 年 5 月 。doi:10.1145/2160718.2160734

[97] Pat Helland: “Idempotence Is Not a Medical Condition,” Communications of the ACM, volume 55, number 5, page 56, May 2012. doi:10.1145/2160718.2160734

[ 98 ] Jay Kreps:“回复:尝试在恢复/倒带上实现确定性行为”,发送至samza-dev邮件列表的电子邮件,2014 年 9 月 9 日。

[98] Jay Kreps: “Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind,” email to samza-dev mailing list, September 9, 2014.

[ 99 ] EN (Mootaz) Elnozahy、Lorenzo Alvisi、Yi-Min Wang 和 David B. Johnson:“消息传递系统中回滚恢复协议的调查”,ACM 计算调查,第 34 卷,第 3 期,第 375 页–408,2002 年 9 月 。doi:10.1145/568522.568525

[99] E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson: “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, volume 34, number 3, pages 375–408, September 2002. doi:10.1145/568522.568525

[ 100 ] Adam Warski:“ Kafka Streams – 它如何适应流处理环境?”,softwaremill.com,2016 年 6 月 1 日。

[100] Adam Warski: “Kafka Streams – How Does It Fit the Stream Processing Landscape?,” softwaremill.com, June 1, 2016.

第 12 章数据系统的未来

Chapter 12. The Future of Data Systems

如果一个事物的目的被注定为另一个事物的目的,那么它的最终目的就不能在于保存它的存在。因此,船长并不打算把保存托付给他的船作为最后的目的,因为一艘船注定要以其他东西作为它的目的,即。导航。

(通常被引用为:如果船长的最高目标是保护他的船,他就会将其永远留在港口。)

圣托马斯·阿奎那,《神学大全》(1265–1274)

If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.

(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in port forever.)

St. Thomas Aquinas, Summa Theologica (1265–1274)

到目前为止,本书主要是描述目前的情况。在最后一章中,我们将把我们的视角转向未来,并讨论事情应该如何发展:我将提出一些想法和方法,我相信这些想法和方法可能会从根本上改善我们设计和构建应用程序的方式。

So far, this book has been mostly about describing things as they are at present. In this final chapter, we will shift our perspective toward the future and discuss how things should be: I will propose some ideas and approaches that, I believe, may fundamentally improve the ways we design and build applications.

对未来的看法和猜测当然是主观的,因此在本章中我将使用第一人称来写我的个人观点。欢迎您不同意他们的观点并形成您自己的观点,但我希望本章中的想法至少能够成为富有成效的讨论的起点,并为经常混淆的概念带来一些清晰度。

Opinions and speculation about the future are of course subjective, and so I will use the first person in this chapter when writing about my personal opinions. You are welcome to disagree with them and form your own opinions, but I hope that the ideas in this chapter will at least be a starting point for a productive discussion and bring some clarity to concepts that are often confused.

第 1 章 概述了本书的目标:探索如何创建可靠可扩展可维护的应用程序和系统。这些主题贯穿了所有章节:例如,我们讨论了许多有助于提高可靠性的容错算法、提高可扩展性的分区以及提高可维护性的演化和抽象机制。在本章中,我们将把所有这些想法结合在一起,并在此基础上展望未来。我们的目标是发现如何设计比当今的应用程序更好的应用程序——健壮、正确、可进化,并最终造福于人类。

The goal of this book was outlined in Chapter 1: to explore how to create applications and systems that are reliable, scalable, and maintainable. These themes have run through all of the chapters: for example, we discussed many fault-tolerance algorithms that help improve reliability, partitioning to improve scalability, and mechanisms for evolution and abstraction that improve maintainability. In this chapter we will bring all of these ideas together, and build on them to envisage the future. Our goal is to discover how to design applications that are better than the ones of today—robust, correct, evolvable, and ultimately beneficial to humanity.

数据整合

Data Integration

本书中反复出现的主题是,对于任何给定的问题,都有多种解决方案,所有这些解决方案都有不同的优点、缺点和权衡。例如,在 第3章讨论存储引擎时,我们看到了日志结构存储、B树和面向列的存储。在第 5 章讨论复制时,我们看到了单领导者、多领导者和无领导者方法。

A recurring theme in this book has been that for any given problem, there are several solutions, all of which have different pros, cons, and trade-offs. For example, when discussing storage engines in Chapter 3, we saw log-structured storage, B-trees, and column-oriented storage. When discussing replication in Chapter 5, we saw single-leader, multi-leader, and leaderless approaches.

如果您遇到诸如“我想存储一些数据并稍后再查找”之类的问题,则没有一种正确的解决方案,而是有许多不同的方法,每种方法都适合不同的情况。软件实现通常必须选择一种特定的方法。让一条代码路径健壮且性能良好已经够困难的了——试图在一个软件中完成所有事情几乎肯定会导致实施效果很差。

If you have a problem such as “I want to store some data and look it up again later,” there is no one right solution, but many different approaches that are each appropriate in different circumstances. A software implementation typically has to pick one particular approach. It’s hard enough to get one code path robust and performing well—trying to do everything in one piece of software almost guarantees that the implementation will be poor.

因此,最合适的软件工具的选择也取决于具体情况。每个软件,甚至所谓的“通用”数据库,都是针对特定的使用模式而设计的。

Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of software, even a so-called “general-purpose” database, is designed for a particular usage pattern.

面对如此丰富的替代方案,第一个挑战是找出软件产品与它们最适合的环境之间的映射。供应商不愿意告诉您他们的软件不适合哪些工作负载,这是可以理解的,但希望前面的章节已经为您提供了一些问题,以便您了解字里行间并更好地理解权衡。

Faced with this profusion of alternatives, the first challenge is then to figure out the mapping between the software products and the circumstances in which they are a good fit. Vendors are understandably reluctant to tell you about the kinds of workloads for which their software is poorly suited, but hopefully the previous chapters have equipped you with some questions to ask in order to read between the lines and better understand the trade-offs.

然而,即使您完全理解工具及其使用环境之间的映射,仍然存在另一个挑战:在复杂的应用程序中,数据通常以多种不同的方式使用。不可能有一款软件适合使用数据的所有不同环境,因此您不可避免地最终必须将多个不同的软件拼凑在一起才能提供应用程序的功能。

However, even if you perfectly understand the mapping between tools and circumstances for their use, there is another challenge: in complex applications, data is often used in several different ways. There is unlikely to be one piece of software that is suitable for all the different circumstances in which the data is used, so you inevitably end up having to cobble together several different pieces of software in order to provide your application’s functionality.

通过导出数据组合专用工具

Combining Specialized Tools by Deriving Data

例如,通常需要将 OLTP 数据库与全文搜索索引集成,以便处理任意关键字的查询。尽管某些数据库(例如 PostgreSQL)包含全文索引功能,这对于简单的应用程序来说已经足够了 [ 1 ],但更复杂的搜索设施需要专业的信息检索工具。相反,搜索索引通常不太适合作为持久记录系统,因此许多应用程序需要结合两种不同的工具才能满足所有要求。

For example, it is common to need to integrate an OLTP database with a full-text search index in order to handle queries for arbitrary keywords. Although some databases (such as PostgreSQL) include a full-text indexing feature, which can be sufficient for simple applications [1], more sophisticated search facilities require specialist information retrieval tools. Conversely, search indexes are generally not very suitable as a durable system of record, and so many applications need to combine two different tools in order to satisfy all of the requirements.

我们在“保持系统同步” 中谈到了集成数据系统的问题。随着数据不同表示形式数量的增加,集成问题变得更加困难。除了数据库和搜索索引之外,也许您还需要在分析系统(数据仓库或批处理和流处理系统)中保留数据的副本;维护从原始数据派生的对象的缓存或非规范化版本;通过机器学习、分类、排名或推荐系统传递数据;或根据数据更改发送通知。

We touched on the issue of integrating data systems in “Keeping Systems in Sync”. As the number of different representations of the data increases, the integration problem becomes harder. Besides the database and the search index, perhaps you need to keep copies of the data in analytics systems (data warehouses, or batch and stream processing systems); maintain caches or denormalized versions of objects that were derived from the original data; pass the data through machine learning, classification, ranking, or recommendation systems; or send notifications based on changes to the data.

令人惊讶的是,我经常看到软件工程师做出这样的陈述:“根据我的经验,99% 的人只需要 X”或“……不需要 X”(对于不同的 X 值)。我认为这样的陈述更多地反映了演讲者的体验,而不是技术的实际用途。您可能想要对数据执行的不同操作的范围之广令人眼花缭乱。一个人认为是晦涩且毫无意义的功能很可能对其他人来说是一项核心要求。通常只有当您缩小范围并考虑整个组织中的数据流时,数据集成的需求才会变得明显。

Surprisingly often I see software engineers make statements like, “In my experience, 99% of people only need X” or “…don’t need X” (for various values of X). I think that such statements say more about the experience of the speaker than about the actual usefulness of a technology. The range of different things you might want to do with data is dizzyingly wide. What one person considers to be an obscure and pointless feature may well be a central requirement for someone else. The need for data integration often only becomes apparent if you zoom out and consider the dataflows across an entire organization.

关于数据流的推理

Reasoning about dataflows

当需要在多个存储系统中维护相同数据的副本以满足不同的访问模式时,您需要非常清楚输入和输出:数据首先写入哪里,以及哪些表示来自哪些来源?如何以正确的格式将数据传输到所有正确的位置?

When copies of the same data need to be maintained in several storage systems in order to satisfy different access patterns, you need to be very clear about the inputs and outputs: where is data written first, and which representations are derived from which sources? How do you get data into all the right places, in the right formats?

例如,您可以安排数据首先写入记录数据库系统,捕获对该数据库所做的更改(请参阅“更改数据捕获”),然后以相同的顺序将更改应用到搜索索引。如果变更数据捕获 (CDC) 是更新索引的唯一方法,则您可以确信索引完全源自记录系统,因此与其一致(除非软件中存在错误)。写入数据库是向该系统提供新输入的唯一方法。

For example, you might arrange for data to first be written to a system of record database, capturing the changes made to that database (see “Change Data Capture”) and then applying the changes to the search index in the same order. If change data capture (CDC) is the only way of updating the index, you can be confident that the index is entirely derived from the system of record, and therefore consistent with it (barring bugs in the software). Writing to the database is the only way of supplying new input into this system.

允许应用程序直接写入搜索索引和数据库会带来如图11-4所示的问题,其中两个客户端同时发送冲突的写入,并且两个存储系统以不同的顺序处理它们。在这种情况下,数据库和搜索索引都不“负责”确定写入顺序,因此它们可能会做出矛盾的决定并变得永久不一致。

Allowing the application to directly write to both the search index and the database introduces the problem shown in Figure 11-4, in which two clients concurrently send conflicting writes, and the two storage systems process them in a different order. In this case, neither the database nor the search index is “in charge” of determining the order of writes, and so they may make contradictory decisions and become permanently inconsistent with each other.

如果您可以通过决定所有写入顺序的单个系统汇集所有用户输入,那么通过以相同顺序处理写入来导出数据的其他表示形式就会变得更加容易。这是我们在“全订单广播”中看到的状态机复制方法的应用。无论您使用变更数据捕获还是事件源日志,都不重要,重要的是决定总订单的原则。

If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order. This is an application of the state machine replication approach that we saw in “Total Order Broadcast”. Whether you use change data capture or an event sourcing log is less important than simply the principle of deciding on a total order.

基于事件日志更新派生数据系统通常可以具有确定性和幂等性(请参阅“幂等性”),从而很容易从故障中恢复。

Updating a derived data system based on an event log can often be made deterministic and idempotent (see “Idempotence”), making it quite easy to recover from faults.

派生数据与分布式事务

Derived data versus distributed transactions

保持不同数据系统彼此一致的经典方法涉及分布式事务,如“原子提交和两阶段提交(2PC)”中所述。与分布式事务相比,使用派生数据系统的方法效果如何?

The classic approach for keeping different data systems consistent with each other involves distributed transactions, as discussed in “Atomic Commit and Two-Phase Commit (2PC)”. How does the approach of using derived data systems fare in comparison to distributed transactions?

在抽象层面上,他们通过不同的方式实现相似的目标。分布式事务通过使用互斥锁来决定写入顺序(请参阅“两相锁定(2PL)”),而 CDC 和事件源则使用日志进行排序。分布式事务使用原子提交来确保更改一次生效,而基于日志的系统通常基于确定性重试和幂等性。

At an abstract level, they achieve a similar goal by different means. Distributed transactions decide on an ordering of writes by using locks for mutual exclusion (see “Two-Phase Locking (2PL)”), while CDC and event sourcing use a log for ordering. Distributed transactions use atomic commit to ensure that changes take effect exactly once, while log-based systems are often based on deterministic retry and idempotence.

最大的区别是交易系统通常提供线性化(请参阅 “线性化”),这意味着有用的保证,例如读取您自己的写入(请参阅“读取您自己的写入”)。另一方面,派生数据系统通常是异步更新的,因此它们默认不提供相同的时序保证。

The biggest difference is that transaction systems usually provide linearizability (see “Linearizability”), which implies useful guarantees such as reading your own writes (see “Reading Your Own Writes”). On the other hand, derived data systems are often updated asynchronously, and so they do not by default offer the same timing guarantees.

在愿意支付分布式事务成本的有限环境中,它们已被成功使用。然而,我认为XA的容错能力和性能特征较差(参见《分布式事务实践》),这严重限制了它的实用性。我相信为分布式事务创建一个更好的协议是可能的,但是让这样的协议被广泛采用并与现有工具集成将是一个挑战,而且不太可能很快发生。

Within limited environments that are willing to pay the cost of distributed transactions, they have been used successfully. However, I think that XA has poor fault tolerance and performance characteristics (see “Distributed Transactions in Practice”), which severely limit its usefulness. I believe that it might be possible to create a better protocol for distributed transactions, but getting such a protocol widely adopted and integrated with existing tools would be challenging, and unlikely to happen soon.

在缺乏对良好的分布式事务协议的广泛支持的情况下,我相信基于日志的派生数据是集成不同数据系统的最有前途的方法。然而,诸如阅读自己的写入之类的保证是有用的,而且我认为告诉每个人“最终一致性是不可避免的——忍受它并学会处理它”是没有成效的(至少在没有关于如何处理它的良好指导的情况下来处理它)。

In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems. However, guarantees such as reading your own writes are useful, and I don’t think that it is productive to tell everyone “eventual consistency is inevitable—suck it up and learn to deal with it” (at least not without good guidance on how to deal with it).

“瞄准正确性”中,我们将讨论一些在异步派生系统之上实现更强保证的方法,并努力实现分布式事务和基于异步日志的系统之间的中间立场。

In “Aiming for Correctness” we will discuss some approaches for implementing stronger guarantees on top of asynchronously derived systems, and work toward a middle ground between distributed transactions and asynchronous log-based systems.

总排序的限制

The limits of total ordering

对于足够小的系统,构建完全有序的事件日志是完全可行的(正如具有单主复制的数据库的流行所证明的那样,它精确地构建了这样的日志)。然而,随着系统扩展到更大、更复杂的工作负载,限制开始出现:

With systems that are small enough, constructing a totally ordered event log is entirely feasible (as demonstrated by the popularity of databases with single-leader replication, which construct precisely such a log). However, as systems are scaled toward bigger and more complex workloads, limitations begin to emerge:

  • 在大多数情况下,构建完全排序的日志需要所有事件都通过决定排序的单个领导节点。如果事件的吞吐量大于单台机器的处理能力,则需要将其分区到多台机器上(请参阅“分区日志”)。两个不同分区中的事件顺序是不明确的。

  • In most cases, constructing a totally ordered log requires all events to pass through a single leader node that decides on the ordering. If the throughput of events is greater than a single machine can handle, you need to partition it across multiple machines (see “Partitioned Logs”). The order of events in two different partitions is then ambiguous.

  • 如果服务器分布在多个地理上分布的数据中心,例如为了容忍整个数据中心离线,通常每个数据中心都有一个单独的领导者,因为网络延迟使得同步跨数据中心协调效率低下(请参阅“多领导者复制” ) ”)。这意味着源自两个不同数据中心的事件的顺序未定义。

  • If the servers are spread across multiple geographically distributed datacenters, for example in order to tolerate an entire datacenter going offline, you typically have a separate leader in each datacenter, because network delays make synchronous cross-datacenter coordination inefficient (see “Multi-Leader Replication”). This implies an undefined ordering of events that originate in two different datacenters.

  • 当应用程序部署为微服务时(请参阅“通过服务的数据流:REST 和 RPC”),常见的设计选择是将每个服务及其持久状态部署为独立单元,服务之间不共享持久状态。当两个事件源自不同的服务时,这些事件没有定义的顺序。

  • When applications are deployed as microservices (see “Dataflow Through Services: REST and RPC”), a common design choice is to deploy each service and its durable state as an independent unit, with no durable state shared between services. When two events originate in different services, there is no defined order for those events.

  • 一些应用程序维护客户端状态,该状态在用户输入时立即更新(无需等待服务器的确认),甚至继续离线工作(请参阅 “离线操作的客户端”)。使用此类应用程序,客户端和服务器很可能会以不同的顺序看到事件。

  • Some applications maintain client-side state that is updated immediately on user input (without waiting for confirmation from a server), and even continue to work offline (see “Clients with offline operation”). With such applications, clients and servers are very likely to see events in different orders.

用正式术语来说,决定事件的总顺序称为全序广播,相当于共识(参见“共识算法和全序广播”)。大多数共识算法是针对单个节点的吞吐量足以处理整个事件流的情况而设计的,并且这些算法没有提供让多个节点分担事件排序工作的机制。设计能够超越单个节点吞吐量并在地理分布式环境中良好运行的共识算法仍然是一个开放的研究问题。

In formal terms, deciding on a total order of events is known as total order broadcast, which is equivalent to consensus (see “Consensus algorithms and total order broadcast”). Most consensus algorithms are designed for situations in which the throughput of a single node is sufficient to process the entire stream of events, and these algorithms do not provide a mechanism for multiple nodes to share the work of ordering the events. It is still an open research problem to design consensus algorithms that can scale beyond the throughput of a single node and that work well in a geographically distributed setting.

对事件进行排序以捕获因果关系

Ordering events to capture causality

在事件之间没有因果关系的情况下,缺乏全序并不是一个大问题,因为并发事件可以任意排序。其他一些情况很容易处理:例如,当同一对象有多个更新时,可以通过将特定对象 ID 的所有更新路由到同一日志分区来完全排序它们。然而,因果依赖性有时会以更微妙的方式出现(另见“顺序和因果关系”)。

In cases where there is no causal link between events, the lack of a total order is not a big problem, since concurrent events can be ordered arbitrarily. Some other cases are easy to handle: for example, when there are multiple updates of the same object, they can be totally ordered by routing all updates for a particular object ID to the same log partition. However, causal dependencies sometimes arise in more subtle ways (see also “Ordering and Causality”).

例如,考虑一个社交网络服务,以及两个处于恋爱关系但刚刚分手的用户。其中一个用户将另一个用户删除为好友,然后向剩下的好友发送一条消息,抱怨他们的前伴侣。用户的意图是他们的前伴侣不应该看到粗鲁的消息,因为该消息是在好友状态被撤销后发送的。

For example, consider a social networking service, and two users who were in a relationship but have just broken up. One of the users removes the other as a friend, and then sends a message to their remaining friends complaining about their ex-partner. The user’s intention is that their ex-partner should not see the rude message, since the message was sent after the friend status was revoked.

然而,在将好友状态存储在一个位置而将消息存储在另一位置的系统中,不好友事件和消息发送事件之间的排序依赖关系可能会丢失。如果未捕获因果依赖性,则发送有关新消息的通知的服务可能会在取消好友事件之前处理消息发送事件,从而错误地将通知发送给前合作伙伴。

However, in a system that stores friendship status in one place and messages in another place, that ordering dependency between the unfriend event and the message-send event may be lost. If the causal dependency is not captured, a service that sends notifications about new messages may process the message-send event before the unfriend event, and thus incorrectly send a notification to the ex-partner.

在此示例中,通知实际上是消息和好友列表之间的联接,这使其与我们之前讨论的联接的时间问题相关(请参阅“ 联接的时间依赖性”)。不幸的是,这个问题似乎没有一个简单的答案 [ 2 , 3 ]。出发点包括:

In this example, the notifications are effectively a join between the messages and the friend list, making it related to the timing issues of joins that we discussed previously (see “Time-dependence of joins”). Unfortunately, there does not seem to be a simple answer to this problem [2, 3]. Starting points include:

  • 逻辑时间戳可以提供无需协调的全排序(请参阅 “序列号排序”),因此在全序广播不可行的情况下它们可能会有所帮助。但是,它们仍然要求接收者处理无序传递的事件,并且需要传递额外的元数据。

  • Logical timestamps can provide total ordering without coordination (see “Sequence Number Ordering”), so they may help in cases where total order broadcast is not feasible. However, they still require recipients to handle events that are delivered out of order, and they require additional metadata to be passed around.

  • 如果您可以记录一个事件来记录用户在做出决定之前看到的系统状态,并为该事件提供唯一的标识符,那么任何后续事件都可以引用该事件标识符以记录因果依赖性[4 ]我们将在“阅读也是事件”中回到这个想法。

  • If you can log an event to record the state of the system that the user saw before making a decision, and give that event a unique identifier, then any later events can reference that event identifier in order to record the causal dependency [4]. We will return to this idea in “Reads are events too”.

  • 冲突解决算法(请参阅“自动冲突解决”)有助于处理以意外顺序传递的事件。它们对于维护状态很有用,但如果操作具有外部副作用(例如向用户发送通知),它们就没有帮助。

  • Conflict resolution algorithms (see “Automatic Conflict Resolution”) help with processing events that are delivered in an unexpected order. They are useful for maintaining state, but they do not help if actions have external side effects (such as sending a notification to a user).

也许,随着时间的推移,应用程序开发的模式将会出现,允许有效地捕获因果依赖关系,并正确维护派生状态,而不会强迫所有事件都经历全序广播的瓶颈

Perhaps, over time, patterns for application development will emerge that allow causal dependencies to be captured efficiently, and derived state to be maintained correctly, without forcing all events to go through the bottleneck of total order broadcast.

批处理和流处理

Batch and Stream Processing

我想说,数据集成的目标是确保数据最终以正确的形式出现在所有正确的位置。这样做需要消耗输入、转换、连接、过滤、聚合、训练模型、评估并最终写入适当的输出。批处理和流处理器是实现这一目标的工具。

I would say that the goal of data integration is to make sure that data ends up in the right form in all the right places. Doing so requires consuming inputs, transforming, joining, filtering, aggregating, training models, evaluating, and eventually writing to the appropriate outputs. Batch and stream processors are the tools for achieving this goal.

批处理和流处理的输出是派生数据集,例如搜索索引、物化视图、向用户显示的建议、聚合指标等(请参阅“批处理工作流的输出”“流处理的使用”)。

The outputs of batch and stream processes are derived datasets such as search indexes, materialized views, recommendations to show to users, aggregate metrics, and so on (see “The Output of Batch Workflows” and “Uses of Stream Processing”).

正如我们在第 10 章和第 11 章中看到的,批处理和流处理有很多共同的原理,主要的根本区别在于流处理器在无界数据集上运行,而批处理输入具有已知的有限大小。处理引擎的实现方式也存在许多细节差异,但这些差异开始变得模糊。

As we saw in Chapter 10 and Chapter 11, batch and stream processing have a lot of principles in common, and the main fundamental difference is that stream processors operate on unbounded datasets whereas batch process inputs are of a known, finite size. There are also many detailed differences in the ways the processing engines are implemented, but these distinctions are beginning to blur.

Spark 通过将流分解为微批次 ,在批处理引擎之上执行流处理 ,而 Apache Flink 在流处理引擎之上执行批处理 [ 5 ]。原则上,一种处理类型可以在另一种处理之上进行模拟,尽管性能特征有所不同:例如,微批处理可能在跳跃或滑动窗口上表现不佳[ 6 ]。

Spark performs stream processing on top of a batch processing engine by breaking the stream into microbatches, whereas Apache Flink performs batch processing on top of a stream processing engine [5]. In principle, one type of processing can be emulated on top of the other, although the performance characteristics vary: for example, microbatching may perform poorly on hopping or sliding windows [6].

维护派生状态

Maintaining derived state

批处理具有相当强烈的函数式风格(即使代码不是用函数式编程语言编写的):它鼓励确定性的纯函数,其输出仅取决于输入,并且除了显式输出之外没有副作用,处理输入作为不可变的并且输出为仅附加的。流处理类似,但它扩展了运算符以允许托管、容错状态(请参阅“故障后重建状态”)。

Batch processing has a quite strong functional flavor (even if the code is not written in a functional programming language): it encourages deterministic, pure functions whose output depends only on the input and which have no side effects other than the explicit outputs, treating inputs as immutable and outputs as append-only. Stream processing is similar, but it extends operators to allow managed, fault-tolerant state (see “Rebuilding state after a failure”).

具有明确定义的输入和输出的确定性函数的原理不仅有利于容错(参见“幂等性”),而且还简化了组织中数据流的推理[ 7 ]。无论派生数据是搜索索引、统计模型还是缓存,从数据管道的角度思考是有帮助的,数据管道从一个事物派生出另一个事物,通过功能应用程序代码推动一个系统中的状态变化并应用效果到派生系统。

The principle of deterministic functions with well-defined inputs and outputs is not only good for fault tolerance (see “Idempotence”), but also simplifies reasoning about the dataflows in an organization [7]. No matter whether the derived data is a search index, a statistical model, or a cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing state changes in one system through functional application code and applying the effects to derived systems.

原则上,派生数据系统可以同步维护,就像关系数据库在写入被索引的表时在同一事务中同步更新二级索引一样。然而,异步性使得基于事件日志的系统变得健壮:它允许系统某一部分的故障被包含在本地,而分布式事务如果任何一个参与者发生故障就会中止,因此它们往往会通过将故障传播到其他部分来放大故障系统的(参见“分布式事务的限制”)。

In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system (see “Limitations of distributed transactions”).

我们在“分区和二级索引”中看到,二级索引经常跨越分区边界。具有二级索引的分区系统需要将写入发送到多个分区(如果索引是术语分区的)或将读取发送到所有分区(如果索引是文档分区的)。如果索引是异步维护的,这种跨分区通信也是最可靠和可扩展的[ 8 ](另见“多分区数据处理”)。

We saw in “Partitioning and Secondary Indexes” that secondary indexes often cross partition boundaries. A partitioned system with secondary indexes either needs to send writes to multiple partitions (if the index is term-partitioned) or send reads to all partitions (if the index is document-partitioned). Such cross-partition communication is also most reliable and scalable if the index is maintained asynchronously [8] (see also “Multi-partition data processing”).

重新处理数据以促进应用程序的发展

Reprocessing data for application evolution

在维护派生数据时,批处理和流处理都很有用。流处理允许输入的更改以低延迟反映在派生视图中,而批处理允许重新处理大量累积的历史数据,以便在现有数据集上派生新视图。

When maintaining derived data, batch and stream processing are both useful. Stream processing allows changes in the input to be reflected in derived views with low delay, whereas batch processing allows large amounts of accumulated historical data to be reprocessed in order to derive new views onto an existing dataset.

特别是,重新处理现有数据为维护系统、改进系统以支持新功能和更改的需求提供了良好的机制(请参阅第 4 章)。如果不进行重新处理,模式演变仅限于简单的更改,例如向记录添加新的可选字段或添加新类型的记录。在写入时模式和读取时模式上下文中都是这种情况(请参阅“文档模型中的模式灵活性”)。另一方面,通过重新处理,可以将数据集重组为完全不同的模型,以便更好地满足新的需求。

In particular, reprocessing existing data provides a good mechanism for maintaining a system, evolving it to support new features and changed requirements (see Chapter 4). Without reprocessing, schema evolution is limited to simple changes like adding a new optional field to a record, or adding a new type of record. This is the case both in a schema-on-write and in a schema-on-read context (see “Schema flexibility in the document model”). On the other hand, with reprocessing it is possible to restructure a dataset into a completely different model in order to better serve new requirements.

派生的观点允许逐渐演变。如果您想要重构数据集,则不需要以突然切换的方式执行迁移。相反,您可以将旧模式和新模式并排维护为相同基础数据的两个独立派生视图。然后,您可以开始将少量用户转移到新视图,以测试其性能并发现任何错误,而大多数用户继续被路由到旧视图。逐渐地,您可以增加访问新视图的用户比例,最终可以删除旧视图[ 10 ]。

Derived views allow gradual evolution. If you want to restructure a dataset, you do not need to perform the migration as a sudden switch. Instead, you can maintain the old schema and the new schema side by side as two independently derived views onto the same underlying data. You can then start shifting a small number of users to the new view in order to test its performance and find any bugs, while most users continue to be routed to the old view. Gradually, you can increase the proportion of users accessing the new view, and eventually you can drop the old view [10].

这种渐进式迁移的美妙之处在于,如果出现问题,该过程的每个阶段都可以轻松逆转:您始终有一个可以返回的工作系统。通过降低不可逆转损坏的风险,您可以更有信心继续前进,从而更快地改进您的系统 [ 11 ]。

The beauty of such a gradual migration is that every stage of the process is easily reversible if something goes wrong: you always have a working system to go back to. By reducing the risk of irreversible damage, you can be more confident about going ahead, and thus move faster to improve your system [11].

拉姆达架构

The lambda architecture

如果使用批处理来重新处理历史数据,使用流处理来处理最近的更新,那么如何将两者结合起来?lambda 架构[ 12 ]是该领域的一项提案,受到了广泛关注。

If batch processing is used to reprocess historical data, and stream processing is used to process recent updates, then how do you combine the two? The lambda architecture [12] is a proposal in this area that has gained a lot of attention.

lambda 架构的核心思想是,应通过将不可变事件附加到始终增长的数据集来记录传入数据,类似于事件溯源(请参阅“ 事件溯源”)。从这些事件中,可以得出读取优化的视图。lambda 架构建议并行运行两个不同的系统:一个批处理系统(例如 Hadoop MapReduce)和一个单独的流处理系统(例如 Storm)。

The core idea of the lambda architecture is that incoming data should be recorded by appending immutable events to an always-growing dataset, similarly to event sourcing (see “Event Sourcing”). From these events, read-optimized views are derived. The lambda architecture proposes running two different systems in parallel: a batch processing system such as Hadoop MapReduce, and a separate stream-processing system such as Storm.

在 lambda 方法中,流处理器消耗事件并快速生成视图的近似更新;批处理器稍后使用同一组事件并生成派生视图的正确版本。这种设计背后的原因是批处理更简单,因此更不容易出现错误,而流处理器被认为可靠性较低且更难容错(请参阅“容错”)。此外,流处理可以使用快速近似算法,而批处理则使用较慢的精确算法。

In the lambda approach, the stream processor consumes the events and quickly produces an approximate update to the view; the batch processor later consumes the same set of events and produces a corrected version of the derived view. The reasoning behind this design is that batch processing is simpler and thus less prone to bugs, while stream processors are thought to be less reliable and harder to make fault-tolerant (see “Fault Tolerance”). Moreover, the stream process can use fast approximate algorithms while the batch process uses slower exact algorithms.

lambda 架构是一个有影响力的想法,它使数据系统的设计变得更好,特别是通过普及派生视图到不可变事件流并在需要时重新处理事件的原则。但我也认为它存在一些实际问题:

The lambda architecture was an influential idea that shaped the design of data systems for the better, particularly by popularizing the principle of deriving views onto streams of immutable events and reprocessing events when needed. However, I also think that it has a number of practical problems:

  • 必须维护相同的逻辑才能在批处理和流处理框架中运行,这是一项巨大的额外工作。尽管像 Summingbird [ 13 ]这样的库提供了可以在批处理或流式上下文中运行的计算的抽象,但是调试、调整和维护两个不同系统的操作复杂性仍然存在 [14 ]

  • Having to maintain the same logic to run both in a batch and in a stream processing framework is significant additional effort. Although libraries such as Summingbird [13] provide an abstraction for computations that can be run in either a batch or a streaming context, the operational complexity of debugging, tuning, and maintaining two different systems remains [14].

  • 由于流管道和批处理管道产生单独的输出,因此需要将它们合并以响应用户请求。如果计算是翻滚窗口上的简单聚合,则这种合并相当容易,但如果使用更复杂的操作(例如连接和会话化)派生视图,或者输出不是时间序列,则合并会变得非常困难。

  • Since the stream pipeline and the batch pipeline produce separate outputs, they need to be merged in order to respond to user requests. This merge is fairly easy if the computation is a simple aggregation over a tumbling window, but it becomes significantly harder if the view is derived using more complex operations such as joins and sessionization, or if the output is not a time series.

  • 尽管能够重新处理整个历史数据集固然很棒,但对于大型数据集来说,频繁这样做成本高昂。因此,批处理管道通常需要设置为处理增量批处理(例如,每小时结束时的一小时数据),而不是重新处理所有内容。这引发了“关于时间的推理”中讨论的问题,例如处理掉队者和处理跨批次边界的窗口。增量批处理计算会增加复杂性,使其更类似于流层,这与保持批处理层尽可能简单的目标背道而驰。

  • Although it is great to have the ability to reprocess the entire historical dataset, doing so frequently is expensive on large datasets. Thus, the batch pipeline often needs to be set up to process incremental batches (e.g., an hour’s worth of data at the end of every hour) rather than reprocessing everything. This raises the problems discussed in “Reasoning About Time”, such as handling stragglers and handling windows that cross boundaries between batches. Incrementalizing a batch computation adds complexity, making it more akin to the streaming layer, which runs counter to the goal of keeping the batch layer as simple as possible.

统一批处理和流处理

Unifying batch and stream processing

最近的工作通过允许在同一系统中实现批处理计算(重新处理历史数据)和流计算(处理事件到达时),使人们能够享受 lambda 架构的好处,而没有其缺点[15 ]

More recent work has enabled the benefits of the lambda architecture to be enjoyed without its downsides, by allowing both batch computations (reprocessing historical data) and stream computations (processing events as they arrive) to be implemented in the same system [15].

在一个系统中统一批处理和流处理需要以下功能,这些功能正变得越来越广泛:

Unifying batch and stream processing in one system requires the following features, which are becoming increasingly widely available:

  • 能够通过处理最近事件流的同一处理引擎重播历史事件。例如,基于日志的消息代理能够重播消息(请参阅 “重播旧消息”),并且某些流处理器可以从 HDFS 等分布式文件系统读取输入。

  • The ability to replay historical events through the same processing engine that handles the stream of recent events. For example, log-based message brokers have the ability to replay messages (see “Replaying old messages”), and some stream processors can read input from a distributed filesystem like HDFS.

  • 流处理器的恰好一次语义,即确保输出与未发生故障时相同,即使实际上确实发生了故障(请参阅“容错”)。与批处理一样,这需要丢弃任何失败任务的部分输出。

  • Exactly-once semantics for stream processors—that is, ensuring that the output is the same as if no faults had occurred, even if faults did in fact occur (see “Fault Tolerance”). Like with batch processing, this requires discarding the partial output of any failed tasks.

  • 按事件时间而非处理时间进行窗口化的工具,因为在重新处理历史事件时处理时间毫无意义(请参阅“关于时间的推理”)。例如,Apache Beam 提供了一个用于表达此类计算的 API,然后可以使用 Apache Flink 或 Google Cloud Dataflow 运行该计算。

  • Tools for windowing by event time, not by processing time, since processing time is meaningless when reprocessing historical events (see “Reasoning About Time”). For example, Apache Beam provides an API for expressing such computations, which can then be run using Apache Flink or Google Cloud Dataflow.

分拆数据库

Unbundling Databases

在最抽象的层面上,数据库、Hadoop 和操作系统都执行相同的功能:它们存储一些数据,并允许您处理和查询该数据 [ 16 ]。数据库将数据存储在某些数据模型的记录中(表中的行、文档、图中的顶点等),而操作系统的文件系统将数据存储在文件中——但从本质上讲,两者都是“信息管理”系统[ 17 ] 。正如我们在第 10 章中看到的,Hadoop 生态系统有点像 Unix 的分布式版本。

At a most abstract level, databases, Hadoop, and operating systems all perform the same functions: they store some data, and they allow you to process and query that data [16]. A database stores data in records of some data model (rows in tables, documents, vertices in a graph, etc.) while an operating system’s filesystem stores data in files—but at their core, both are “information management” systems [17]. As we saw in Chapter 10, the Hadoop ecosystem is somewhat like a distributed version of Unix.

当然,存在许多实际差异。例如,许多文件系统不能很好地应对包含1000万个小文件的目录,而包含1000万个小记录的数据库则完全正常且不起眼。尽管如此,操作系统和数据库之间的异同还是值得探讨的。

Of course, there are many practical differences. For example, many filesystems do not cope very well with a directory containing 10 million small files, whereas a database containing 10 million small records is completely normal and unremarkable. Nevertheless, the similarities and differences between operating systems and databases are worth exploring.

Unix 和关系数据库以截然不同的理念解决信息管理问题。Unix 认为其目的是为程序员提供逻辑但相当低级的硬件抽象,而关系数据库则希望为应用程序程序员提供高级抽象,隐藏磁盘上数据结构、并发性、崩溃恢复等的复杂性。 。Unix 开发了只是字节序列的管道和文件,而数据库开发了 SQL 和事务。

Unix and relational databases have approached the information management problem with very different philosophies. Unix viewed its purpose as presenting programmers with a logical but fairly low-level hardware abstraction, whereas relational databases wanted to give application programmers a high-level abstraction that would hide the complexities of data structures on disk, concurrency, crash recovery, and so on. Unix developed pipes and files that are just sequences of bytes, whereas databases developed SQL and transactions.

哪种方法更好?当然,这取决于你想要什么。Unix 是“更简单”的,因为它是硬件资源的一个相当薄的包装。关系数据库“更简单”,因为简短的声明性查询可以利用许多强大的基础设施(查询优化、索引、连接方法、并发控制、复制等),而查询的作者不需要了解实现细节。

Which approach is better? Of course, it depends what you want. Unix is “simpler” in the sense that it is a fairly thin wrapper around hardware resources; relational databases are “simpler” in the sense that a short declarative query can draw on a lot of powerful infrastructure (query optimization, indexes, join methods, concurrency control, replication, etc.) without the author of the query needing to understand the implementation details.

这些哲学之间的紧张关系已经持续了几十年(Unix 和关系模型都出现在 20 世纪 70 年代初),但仍然没有得到解决。例如,我将 NoSQL 运动解释为想要将 Unix 式的低级抽象方法应用于分布式 OLTP 数据存储领域。

The tension between these philosophies has lasted for decades (both Unix and the relational model emerged in the early 1970s) and still isn’t resolved. For example, I would interpret the NoSQL movement as wanting to apply a Unix-esque approach of low-level abstractions to the domain of distributed OLTP data storage.

在本节中,我将尝试调和这两种哲学,希望我们能够结合两个世界的优点。

In this section I will attempt to reconcile the two philosophies, in the hope that we can combine the best of both worlds.

组合数据存储技术

Composing Data Storage Technologies

在本书中,我们讨论了数据库提供的各种功能及其工作原理,包括:

Over the course of this book we have discussed various features provided by databases and how they work, including:

第 10章和 第11章中,出现了类似的主题。我们讨论了构建全文搜索索引(请参阅 “批处理工作流的输出”)、物化视图维护(请参阅“维护物化视图”)以及将更改从数据库复制到派生数据系统(请参阅“更改数据捕获”)。 ”)。

In Chapters 10 and 11, similar themes emerged. We talked about building full-text search indexes (see “The Output of Batch Workflows”), about materialized view maintenance (see “Maintaining materialized views”), and about replicating changes from a database to derived data systems (see “Change Data Capture”).

数据库中内置的功能与人们使用批处理和流处理器构建的派生数据系统之间似乎存在相似之处。

It seems that there are parallels between the features that are built into databases and the derived data systems that people are building with batch and stream processors.

创建索引

Creating an index

想想当您运行CREATE INDEX在关系数据库中创建新索引时会发生什么。数据库必须扫描表的一致快照,挑选出所有正在索引的字段值,对它们进行排序,然后写出索引。然后,它必须处理自拍摄一致快照以来积压的写入(假设在创建索引时表未锁定,因此写入可以继续)。完成此操作后,每当事务写入表时,数据库都必须继续使索引保持最新。

Think about what happens when you run CREATE INDEX to create a new index in a relational database. The database has to scan over a consistent snapshot of a table, pick out all of the field values being indexed, sort them, and write out the index. Then it must process the backlog of writes that have been made since the consistent snapshot was taken (assuming the table was not locked while creating the index, so writes could continue). Once that is done, the database must continue to keep the index up to date whenever a transaction writes to the table.

此过程与设置新的关注者副本非常相似(请参阅 “设置新的关注者”),也与在流系统中引导更改数据捕获非常相似(请参阅“初始快照”)。

This process is remarkably similar to setting up a new follower replica (see “Setting Up New Followers”), and also very similar to bootstrapping change data capture in a streaming system (see “Initial snapshot”).

每当您运行时,数据库本质上都会重新处理现有数据集(如“重新处理数据以促进应用程序发展”CREATE INDEX中所述)并派生索引作为现有数据的新视图。现有数据可能是状态的快照,而不是曾经发生的所有更改的日志,但两者密切相关(请参阅“状态、流和不变性”)。

Whenever you run CREATE INDEX, the database essentially reprocesses the existing dataset (as discussed in “Reprocessing data for application evolution”) and derives the index as a new view onto the existing data. The existing data may be a snapshot of the state rather than a log of all changes that ever happened, but the two are closely related (see “State, Streams, and Immutability”).

一切的元数据库

The meta-database of everything

从这个角度来看,我认为整个组织的数据流开始看起来像一个巨大的数据库 [ 7 ]。每当批处理、流或 ETL 流程将数据从一个位置和表单传输到另一个位置和表单时,它就像数据库子系统一样,使索引或物化视图保持最新。

In this light, I think that the dataflow across an entire organization starts looking like one huge database [7]. Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database subsystem that keeps indexes or materialized views up to date.

从这个角度来看,批处理和流处理器就像触发器、存储过程和物化视图维护例程的精心实现。他们维护的派生数据系统就像不同的索引类型。例如,关系数据库可能支持B树索引、哈希索引、空间索引(参见“多列索引”)和其他类型的索引。在新兴的派生数据系统架构中,这些设施不是作为单个集成数据库产品的功能来实现的,而是由各种不同的软件提供,运行在不同的机器上,由不同的团队管理。

Viewed like this, batch and stream processors are like elaborate implementations of triggers, stored procedures, and materialized view maintenance routines. The derived data systems they maintain are like different index types. For example, a relational database may support B-tree indexes, hash indexes, spatial indexes (see “Multi-column indexes”), and other types of indexes. In the emerging architecture of derived data systems, instead of implementing those facilities as features of a single integrated database product, they are provided by various different pieces of software, running on different machines, administered by different teams.

这些发展未来将把我们带向何方?如果我们从不存在适合所有访问模式的单一数据模型或存储格式的前提出发,我推测有两种途径可以将不同的存储和处理工具组合成一个内聚的系统:

Where will these developments take us in the future? If we start from the premise that there is no single data model or storage format that is suitable for all access patterns, I speculate that there are two avenues by which different storage and processing tools can nevertheless be composed into a cohesive system:

联合数据库:统一读取
Federated databases: unifying reads

可以为各种底层存储引擎和处理方法提供统一的查询接口,这种方法称为联合数据库或多存储 [ 18 , 19 ]。例如,PostgreSQL 的外部数据包装器功能就适合这种模式 [ 20 ]。需要专门数据模型或查询接口的应用程序仍然可以直接访问底层存储引擎,而想要组合来自不同位置的数据的用户可以通过联合接口轻松实现。

联合查询接口遵循单一集成系统的关系传统,具有高级查询语言和优雅的语义,但实现复杂。

It is possible to provide a unified query interface to a wide variety of underlying storage engines and processing methods—an approach known as a federated database or polystore [18, 19]. For example, PostgreSQL’s foreign data wrapper feature fits this pattern [20]. Applications that need a specialized data model or query interface can still access the underlying storage engines directly, while users who want to combine data from disparate places can do so easily through the federated interface.

A federated query interface follows the relational tradition of a single integrated system with a high-level query language and elegant semantics, but a complicated implementation.

非捆绑数据库:统一写入
Unbundled databases: unifying writes

虽然联合解决了跨多个不同系统的只读查询问题,但对于跨这些系统的同步写入没有一个好的答案。我们说过,在单个数据库中,创建一致索引是一个内置功能。当我们组成多个存储系统时,我们同样需要确保所有数据更改最终都在正确的位置,即使出现故障也是如此。使存储系统更容易可靠地插在一起(例如,通过更改数据捕获和事件日志)就像以可以跨不同技术同步写入的方式分拆数据库的索引维护功能一样[ 7 , 21 ]

非捆绑方法遵循 Unix 的小工具传统,这些小工具可以很好地完成一件事 [ 22 ],通过统一的低级 API(管道)进行通信,并且可以使用高级语言(shell)进行组合 [ 16 ] 。

While federation addresses read-only querying across several different systems, it does not have a good answer to synchronizing writes across those systems. We said that within a single database, creating a consistent index is a built-in feature. When we compose several storage systems, we similarly need to ensure that all data changes end up in all the right places, even in the face of faults. Making it easier to reliably plug together storage systems (e.g., through change data capture and event logs) is like unbundling a database’s index-maintenance features in a way that can synchronize writes across disparate technologies [7, 21].

The unbundled approach follows the Unix tradition of small tools that do one thing well [22], that communicate through a uniform low-level API (pipes), and that can be composed using a higher-level language (the shell) [16].

进行分拆工作

Making unbundling work

联合和分拆是同一枚硬币的两个方面:用不同的组件组成一个可靠、可扩展且可维护的系统。联合只读查询需要将一种数据模型映射到另一种数据模型,这需要一些思考,但最终是一个相当容易管理的问题。我认为保持对多个存储系统的写入同步是更难的工程问题,因此我将重点关注它。

Federation and unbundling are two sides of the same coin: composing a reliable, scalable, and maintainable system out of diverse components. Federated read-only querying requires mapping one data model into another, which takes some thought but is ultimately quite a manageable problem. I think that keeping the writes to several storage systems in sync is the harder engineering problem, and so I will focus on it.

同步写入的传统方法需要跨异构存储系统的分布式事务[ 18 ],我认为这是错误的解决方案(参见“派生数据与分布式事务”)。单个存储或流处理系统内的事务是可行的,但是当数据跨越不同技术之间的边界时,我相信具有幂等写入的异步事件日志是一种更加健壮和实用的方法。

The traditional approach to synchronizing writes requires distributed transactions across heterogeneous storage systems [18], which I think is the wrong solution (see “Derived data versus distributed transactions”). Transactions within a single storage or stream processing system are feasible, but when data crosses the boundary between different technologies, I believe that an asynchronous event log with idempotent writes is a much more robust and practical approach.

例如,在某些流处理器中使用分布式事务来实现一次性语义(请参阅“原子提交重访”),这可以很好地工作。然而,当事务需要涉及由不同人群编写的系统时(例如,当数据从流处理器写入分布式键值存储或搜索索引时),缺乏标准化事务协议会使集成变得更加困难。具有幂等消费者的有序事件日志(参见“幂等”)是一个更简单的抽象,因此跨异构系统实现更可行[ 7 ]。

For example, distributed transactions are used within some stream processors to achieve exactly-once semantics (see “Atomic commit revisited”), and this can work quite well. However, when a transaction would need to involve systems written by different groups of people (e.g., when data is written from a stream processor to a distributed key-value store or search index), the lack of a standardized transaction protocol makes integration much harder. An ordered log of events with idempotent consumers (see “Idempotence”) is a much simpler abstraction, and thus much more feasible to implement across heterogeneous systems [7].

基于日志的集成的一大优点是各个组件之间的松耦合,这体现在两个方面:

The big advantage of log-based integration is loose coupling between the various components, which manifests itself in two ways:

  1. 在系统级别,异步事件流使系统作为一个整体对于单个组件的中断或性能下降更加稳健。如果消费者运行缓慢或失败,事件日志可以缓冲消息(请参阅“磁盘空间使用情况”),从而允许生产者和任何其他消费者继续运行而不受影响。故障消费者可以在修复后赶上,因此不会丢失任何数据,故障也得到了遏制。相比之下,分布式事务的同步交互往往会将局部故障升级为大规模故障(参见 “分布式事务的局限性”)。

  2. At a system level, asynchronous event streams make the system as a whole more robust to outages or performance degradation of individual components. If a consumer runs slow or fails, the event log can buffer messages (see “Disk space usage”), allowing the producer and any other consumers to continue running unaffected. The faulty consumer can catch up when it is fixed, so it doesn’t miss any data, and the fault is contained. By contrast, the synchronous interaction of distributed transactions tends to escalate local faults into large-scale failures (see “Limitations of distributed transactions”).

  3. 在人的层面上,分拆数据系统允许不同的团队独立地开发、改进和维护不同的软件组件和服务。专业化使每个团队能够专注于做好一件事,并与其他团队的系统有明确定义的接口。事件日志提供了一个足够强大的接口,可以捕获相当强的一致性属性(由于事件的持久性和排序),而且也足够通用,可以适用于几乎任何类型的数据。

  4. At a human level, unbundling data systems allows different software components and services to be developed, improved, and maintained independently from each other by different teams. Specialization allows each team to focus on doing one thing well, with well-defined interfaces to other teams’ systems. Event logs provide an interface that is powerful enough to capture fairly strong consistency properties (due to durability and ordering of events), but also general enough to be applicable to almost any kind of data.

非捆绑式系统与集成式系统

Unbundled versus integrated systems

如果分拆确实成为未来的方式,它不会取代当前形式的数据库——它们仍然会像以前一样被需要。仍然需要数据库来维护流处理器中的状态,并为批处理和流处理器的输出提供查询服务(请参阅“批处理工作流的输出”“处理流”)。专用查询引擎对于特定工作负载仍然很重要:例如,MPP 数据仓库中的查询引擎针对探索性分析查询进行了优化,并且可以很好地处理此类工作负载(请参阅“Hadoop 与分布式数据库的比较”

If unbundling does indeed become the way of the future, it will not replace databases in their current form—they will still be needed as much as ever. Databases are still required for maintaining state in stream processors, and in order to serve queries for the output of batch and stream processors (see “The Output of Batch Workflows” and “Processing Streams”). Specialized query engines will continue to be important for particular workloads: for example, query engines in MPP data warehouses are optimized for exploratory analytic queries and handle this kind of workload very well (see “Comparing Hadoop to Distributed Databases”).

运行多个不同基础设施的复杂性可能是一个问题:每个软件都有学习曲线、配置问题和操作怪癖,因此值得部署尽可能少的移动部件。与由使用应用程序代码组成的多个工具组成的系统相比,单个集成软件产品也可能能够在其设计的工作负载类型上实现更好、更可预测的性能[23 ]。正如我在前言中所说,构建您不需要的规模是浪费精力,并且可能会将您锁定在不灵活的设计中。实际上,这是一种过早优化的形式。

The complexity of running several different pieces of infrastructure can be a problem: each piece of software has a learning curve, configuration issues, and operational quirks, and so it is worth deploying as few moving parts as possible. A single integrated software product may also be able to achieve better and more predictable performance on the kinds of workloads for which it is designed, compared to a system consisting of several tools that you have composed with application code [23]. As I said in the Preface, building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization.

分拆的目标不是与单个数据库在特定工作负载的性能上竞争;目标是允许您组合多个不同的数据库,以便为更广泛的工作负载实现比单个软件更广泛的性能。它涉及的是广度,而不是深度——这与我们在“Hadoop 与分布式数据库的比较”中讨论的存储和处理模型的多样性是一样的。

The goal of unbundling is not to compete with individual databases on performance for particular workloads; the goal is to allow you to combine several different databases in order to achieve good performance for a much wider range of workloads than is possible with a single piece of software. It’s about breadth, not depth—in the same vein as the diversity of storage and processing models that we discussed in “Comparing Hadoop to Distributed Databases”.

因此,如果有一种技术可以满足您所需的一切,那么您很可能最好只是使用该产品,而不是尝试自己从较低级别的组件中重新实现它。只有当没有单一软件可以满足您的所有要求时,分拆和组合的优势才会显现出来。

Thus, if there is a single technology that does everything you need, you’re most likely best off simply using that product rather than trying to reimplement it yourself from lower-level components. The advantages of unbundling and composition only come into the picture when there is no single piece of software that satisfies all your requirements.

少了什么东西?

What’s missing?

构建数据系统的工具正在变得越来越好,但我认为还缺少一个主要部分:我们还没有相当于 Unix shell 的非捆绑数据库(即,一种用于在简单且声明的方式)。

The tools for composing data systems are getting better, but I think one major part is missing: we don’t yet have the unbundled-database equivalent of the Unix shell (i.e., a high-level language for composing storage and processing systems in a simple and declarative way).

例如,如果我们可以简单地声明mysql | elasticsearch(类似于 Unix 管道 [ 22 ]),我会很高兴,这将是未捆绑的等价物CREATE INDEX:它将获取 MySQL 数据库中的所有文档并在 Elasticsearch 集群中对它们进行索引。然后,它会不断捕获对数据库所做的所有更改,并自动将它们应用到搜索索引,而无需我们编写自定义应用程序代码。几乎任何类型的存储或索引系统都应该可以进行这种集成。

For example, I would love it if we could simply declare mysql | elasticsearch, by analogy to Unix pipes [22], which would be the unbundled equivalent of CREATE INDEX: it would take all the documents in a MySQL database and index them in an Elasticsearch cluster. It would then continually capture all the changes made to the database and automatically apply them to the search index, without us having to write custom application code. This kind of integration should be possible with almost any kind of storage or indexing system.

同样,如果能够更轻松地预计算和更新缓存,那就太好了。回想一下,物化视图本质上是一个预先计算的缓存,因此您可以想象通过以声明方式为复杂查询指定物化视图来创建缓存,包括图上的递归查询(请参阅“类图数据模型”)和应用程序逻辑。该领域有一些有趣的早期研究,例如差分数据流 [ 24 , 25 ],我希望这些想法能够进入生产系统。

Similarly, it would be great to be able to precompute and update caches more easily. Recall that a materialized view is essentially a precomputed cache, so you could imagine creating a cache by declaratively specifying materialized views for complex queries, including recursive queries on graphs (see “Graph-Like Data Models”) and application logic. There is interesting early-stage research in this area, such as differential dataflow [24, 25], and I hope that these ideas will find their way into production systems.

围绕数据流设计应用程序

Designing Applications Around Dataflow

通过将专门的存储和处理系统与应用程序代码组合起来来分拆数据库的方法也被称为“数据库由内而外”方法 [ 26 ],这是根据我在 2014 年发表的一次会议演讲的标题 [ 27 ] 得出的。然而,称其为“新建筑”未免太过宏大。我更多地将其视为一种设计模式,一种讨论的起点,我们简单地给它起一个名字,以便我们可以更好地讨论它。

The approach of unbundling databases by composing specialized storage and processing systems with application code is also becoming known as the “database inside-out” approach [26], after the title of a conference talk I gave in 2014 [27]. However, calling it a “new architecture” is too grandiose. I see it more as a design pattern, a starting point for discussion, and we give it a name simply so that we can better talk about it.

这些想法不是我的;它们只是其他人想法的融合,我认为我们应该从中学习。特别是,与 Oz [ 28 ] 和 Juttle [ 29 ] 等数据流语言、 Elm [ 30 , 31 ] 等函数反应式编程(FRP) 语言有很多重叠,逻辑编程语言,例如 Bloom [ 32 ]。在这种情况下,术语“分拆”是由 Jay Kreps 提出的 [ 7 ]。

These ideas are not mine; they are simply an amalgamation of other people’s ideas from which I think we should learn. In particular, there is a lot of overlap with dataflow languages such as Oz [28] and Juttle [29], functional reactive programming (FRP) languages such as Elm [30, 31], and logic programming languages such as Bloom [32]. The term unbundling in this context was proposed by Jay Kreps [7].

甚至电子表格也具有远远领先于大多数主流编程语言的数据流编程能力[ 33 ]。在电子表格中,您可以将公式放入一个单元格中(例如,另一列中的单元格总和),每当公式的任何输入发生更改时,都会自动重新计算公式的结果。这正是我们在数据系统级别想要的:当数据库中的记录发生更改时,我们希望该记录的任何索引都能自动更新,并且依赖于该记录的任何缓存视图或聚合都能够自动刷新。您不必担心刷新如何发生的技术细节,而只需相信它可以正常工作。

Even spreadsheets have dataflow programming capabilities that are miles ahead of most mainstream programming languages [33]. In a spreadsheet, you can put a formula in one cell (for example, the sum of cells in another column), and whenever any input to the formula changes, the result of the formula is automatically recalculated. This is exactly what we want at a data system level: when a record in a database changes, we want any index for that record to be automatically updated, and any cached views or aggregations that depend on the record to be automatically refreshed. You should not have to worry about the technical details of how this refresh happens, but be able to simply trust that it works correctly.

因此,我认为大多数数据系统仍然需要从 VisiCalc 1979 年已经拥有的功能中学习一些东西 [ 34 ]。与电子表格的不同之处在于,当今的数据系统需要具有容错性、可扩展性并且能够持久存储数据。他们还需要能够随着时间的推移集成不同人群编写的不同技术,并重用现有的库和服务:期望所有软件都使用一种特定的语言、框架或工具开发是不现实的。

Thus, I think that most data systems still have something to learn from the features that VisiCalc already had in 1979 [34]. The difference from spreadsheets is that today’s data systems need to be fault-tolerant, scalable, and store data durably. They also need to be able to integrate disparate technologies written by different groups of people over time, and reuse existing libraries and services: it is unrealistic to expect all software to be developed using one particular language, framework, or tool.

在本节中,我将扩展这些想法,并探索围绕非捆绑数据库和数据流的想法构建应用程序的一些方法。

In this section I will expand on these ideas and explore some ways of building applications around the ideas of unbundled databases and dataflow.

作为导出函数的应用程序代码

Application code as a derivation function

当一个数据集派生自另一个数据集时,它会经历某种转换函数。例如:

When one dataset is derived from another, it goes through some kind of transformation function. For example:

  • 二级索引是一种具有简单转换功能的派生数据集:对于基表中的每一行或文档,它挑选出正在索引的列或字段中的值,并按这些值进行排序(假设是 B 树或SSTable 索引,按键排序,如第 3 章所述)。

  • A secondary index is a kind of derived dataset with a straightforward transformation function: for each row or document in the base table, it picks out the values in the columns or fields being indexed, and sorts by those values (assuming a B-tree or SSTable index, which are sorted by key, as discussed in Chapter 3).

  • 全文搜索索引是通过应用各种自然语言处理功能(例如语言检测、分词、词干或词形还原、拼写纠正和同义词识别)创建的,然后构建用于高效查找的数据结构(例如倒排索引) 。

  • A full-text search index is created by applying various natural language processing functions such as language detection, word segmentation, stemming or lemmatization, spelling correction, and synonym identification, followed by building a data structure for efficient lookups (such as an inverted index).

  • 在机器学习系统中,我们可以将模型视为通过应用各种特征提取和统计分析函数从训练数据中导出的模型。当模型应用于新的输入数据时,模型的输出来自输入和模型(因此间接来自训练数据)。

  • In a machine learning system, we can consider the model as being derived from the training data by applying various feature extraction and statistical analysis functions. When the model is applied to new input data, the output of the model is derived from the input and the model (and hence, indirectly, from the training data).

  • 缓存通常包含数据的聚合,其形式将在用户界面 (UI) 中显示。因此,填充缓存需要了解 UI 中引用了哪些字段;UI 中的更改可能需要更新缓存填充方式的定义并重建缓存。

  • A cache often contains an aggregation of data in the form in which it is going to be displayed in a user interface (UI). Populating the cache thus requires knowledge of what fields are referenced in the UI; changes in the UI may require updating the definition of how the cache is populated and rebuilding the cache.

二级索引的派生函数非常普遍,因此它作为核心功能内置到许多数据库中,您只需说 即可调用它CREATE INDEX。对于全文索引,常见语言的基本语言特征可以内置到数据库中,但更复杂的特征通常需要针对特定​​领域的调整。在机器学习中,特征工程众所周知是特定于应用程序的,并且通常必须结合有关用户交互和应用程序部署的详细知识[ 35 ]。

The derivation function for a secondary index is so commonly required that it is built into many databases as a core feature, and you can invoke it by merely saying CREATE INDEX. For full-text indexing, basic linguistic features for common languages may be built into a database, but the more sophisticated features often require domain-specific tuning. In machine learning, feature engineering is notoriously application-specific, and often has to incorporate detailed knowledge about the user interaction and deployment of an application [35].

当创建派生数据集的函数不是标准千篇一律的函数(例如创建二级索引)时,需要自定义代码来处理特定于应用程序的方面。而这个自定义代码正是许多数据库遇到困难的地方。尽管关系数据库通常支持触发器、存储过程和用户​​定义函数(可用于在数据库中执行应用程序代码),但它们在数据库设计中有些是事后才想到的(请参阅“传输事件流”

When the function that creates a derived dataset is not a standard cookie-cutter function like creating a secondary index, custom code is required to handle the application-specific aspects. And this custom code is where many databases struggle. Although relational databases commonly support triggers, stored procedures, and user-defined functions, which can be used to execute application code within the database, they have been somewhat of an afterthought in database design (see “Transmitting Event Streams”).

应用程序代码和状态分离

Separation of application code and state

理论上,数据库可以是任意应用程序代码的部署环境,就像操作系统一样。然而,在实践中,它们并不适合这个目的。它们不太适合现代应用程序开发的要求,例如依赖项和包管理、版本控制、滚动升级、可演化性、监控、指标、网络服务调用以及与外部系统的集成。

In theory, databases could be deployment environments for arbitrary application code, like an operating system. However, in practice they have turned out to be poorly suited for this purpose. They do not fit well with the requirements of modern application development, such as dependency and package management, version control, rolling upgrades, evolvability, monitoring, metrics, calls to network services, and integration with external systems.

另一方面,Mesos、YARN、Docker、Kubernetes 等部署和集群管理工具是专门为运行应用程序代码而设计的。通过专注于做好一件事,他们能够比数据库做得更好,数据库提供用户定义函数的执行作为其众多功能之一。

On the other hand, deployment and cluster management tools such as Mesos, YARN, Docker, Kubernetes, and others are designed specifically for the purpose of running application code. By focusing on doing one thing well, they are able to do it much better than a database that provides execution of user-defined functions as one of its many features.

我认为系统的某些部分专门用于持久数据存储,而其他部分专门用于运行应用程序代码是有意义的。两者可以互动,同时仍保持独立。

I think it makes sense to have some parts of a system that specialize in durable data storage, and other parts that specialize in running application code. The two can interact while still remaining independent.

如今,大多数 Web 应用程序都部署为无状态服务,其中任何用户请求都可以路由到任何应用程序服务器,并且服务器在发送响应后就会忘记有关请求的所有内容。这种部署方式很方便,因为可以随意添加或删除服务器,但状态必须转移到某个地方:通常是数据库。趋势是将无状态应用程序逻辑与状态管理(数据库)分开:不将应用程序逻辑放入数据库中,也不将持久状态放入应用程序中[ 36 ]。正如函数式编程社区的人们喜欢开玩笑一样,“我们相信教会与国家的分离”[ 37 ]。

Most web applications today are deployed as stateless services, in which any user request can be routed to any application server, and the server forgets everything about the request once it has sent the response. This style of deployment is convenient, as servers can be added or removed at will, but the state has to go somewhere: typically, a database. The trend has been to keep stateless application logic separate from state management (databases): not putting application logic in the database and not putting persistent state in the application [36]. As people in the functional programming community like to joke, “We believe in the separation of Church and state” [37].i

在这个典型的 Web 应用程序模型中,数据库充当一种可变共享变量,可以通过网络同步访问。应用程序可以读取和更新变量,数据库负责使其持久化,并提供一些并发控制和容错能力。

In this typical web application model, the database acts as a kind of mutable shared variable that can be accessed synchronously over the network. The application can read and update the variable, and the database takes care of making it durable, providing some concurrency control and fault tolerance.

但是,在大多数编程语言中,您无法订阅可变变量中的更改 - 您只能定期读取它。与电子表格不同的是,如果变量的值发生变化,变量的读取者不会收到通知。(您可以在自己的代码中实现此类通知 - 这称为观察者模式- 但大多数语言没有此模式作为内置功能。)

However, in most programming languages you cannot subscribe to changes in a mutable variable—you can only read it periodically. Unlike in a spreadsheet, readers of the variable don’t get notified if the value of the variable changes. (You can implement such notifications in your own code—this is known as the observer pattern—but most languages do not have this pattern as a built-in feature.)

数据库继承了这种对可变数据的被动方法:如果您想查明数据库的内容是否已更改,通常您唯一的选择是轮询(即定期重复查询)。订阅变更作为一项功能才刚刚开始出现(请参阅 “变更流的 API 支持”)。

Databases have inherited this passive approach to mutable data: if you want to find out whether the content of the database has changed, often your only option is to poll (i.e., to repeat your query periodically). Subscribing to changes is only just beginning to emerge as a feature (see “API support for change streams”).

数据流:状态更改和应用程序代码之间的相互作用

Dataflow: Interplay between state changes and application code

从数据流的角度考虑应用程序意味着重新协商应用程序代码和状态管理之间的关系。我们没有将数据库视为由应用程序操纵的被动变量,而是更多地考虑状态、状态更改和处理它们的代码之间的相互作用和协作。应用程序代码通过触发另一处的状态更改来响应一处的状态更改。

Thinking about applications in terms of dataflow implies renegotiating the relationship between application code and state management. Instead of treating a database as a passive variable that is manipulated by the application, we think much more about the interplay and collaboration between state, state changes, and code that processes them. Application code responds to state changes in one place by triggering state changes in another place.

我们在“数据库和流” 中看到了这种思路,其中我们讨论了将数据库更改日志视为我们可以订阅的事件流。消息传递系统,例如参与者(参见“消息传递数据流”)也具有响应事件的概念。早在 20 世纪 80 年代,元组空间模型就探索了用观察状态变化并对其做出反应的过程来表达分布式计算 [ 38 , 39 ]。

We saw this line of thinking in “Databases and Streams”, where we discussed treating the log of changes to a database as a stream of events that we can subscribe to. Message-passing systems such as actors (see “Message-Passing Dataflow”) also have this concept of responding to events. Already in the 1980s, the tuple spaces model explored expressing distributed computations in terms of processes that observe state changes and react to them [38, 39].

正如所讨论的,当触发器由于数据更改而触发时,或者当更新辅助索引以反映正在索引的表中的更改时,数据库内部会发生类似的情况。分拆数据库意味着采用这一想法并将其应用于主数据库之外的派生数据集的创建:缓存、全文搜索索引、机器学习或分析系统。为此,我们可以使用流处理和消息传递系统。

As discussed, similar things happen inside a database when a trigger fires due to a data change, or when a secondary index is updated to reflect a change in the table being indexed. Unbundling the database means taking this idea and applying it to the creation of derived datasets outside of the primary database: caches, full-text search indexes, machine learning, or analytics systems. We can use stream processing and messaging systems for this purpose.

要记住的重要一点是,维护派生数据与异步作业执行不同,传统上为异步作业执行设计消息传递系统(请参阅“ 日志与传统消息传递的比较”):

The important thing to keep in mind is that maintaining derived data is not the same as asynchronous job execution, for which messaging systems are traditionally designed (see “Logs compared to traditional messaging”):

  • 在维护派生数据时,状态更改的顺序通常很重要(如果从事件日志派生多个视图,则它们需要以相同的顺序处理事件,以便它们保持彼此一致)。正如“确认和重新传递”中所讨论的,许多消息代理在重新传递未确认的消息时不具有此属性。双重写入也被排除(参见“保持系统同步”)。

  • When maintaining derived data, the order of state changes is often important (if several views are derived from an event log, they need to process the events in the same order so that they remain consistent with each other). As discussed in “Acknowledgments and redelivery”, many message brokers do not have this property when redelivering unacknowledged messages. Dual writes are also ruled out (see “Keeping Systems in Sync”).

  • 容错能力是派生数据的关键:仅丢失一条消息就会导致派生数据集与其数据源永久不同步。消息传递和派生状态更新都必须可靠。例如,许多 Actor 系统默认在内存中维护 Actor 状态和消息,因此如果运行 Actor 的机器崩溃,它们就会丢失。

  • Fault tolerance is key for derived data: losing just a single message causes the derived dataset to go permanently out of sync with its data source. Both message delivery and derived state updates must be reliable. For example, many actor systems by default maintain actor state and messages in memory, so they are lost if the machine running the actor crashes.

稳定的消息排序和容错消息处理是相当严格的要求,但它们比分布式事务便宜得多并且操作上更健壮。现代流处理器可以大规模提供这些排序和可靠性保证,并且它们允许应用程序代码作为流运算符运行。

Stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed transactions. Modern stream processors can provide these ordering and reliability guarantees at scale, and they allow application code to be run as stream operators.

该应用程序代码可以执行数据库中内置推导函数通常不提供的任意处理。就像通过管道链接的 Unix 工具一样,流运算符可以围绕数据流构建大型系统。每个运算符将状态变化流作为输入,并产生其他状态变化流作为输出。

This application code can do the arbitrary processing that built-in derivation functions in databases generally don’t provide. Like Unix tools chained by pipes, stream operators can be composed to build large systems around dataflow. Each operator takes streams of state changes as input, and produces other streams of state changes as output.

流处理器和服务

Stream processors and services

当前流行的应用程序开发风格涉及将功能分解为一组服务,这些服务通过同步网络请求(例如 REST API)进行通信(请参阅 “通过服务的数据流:REST 和 RPC”)。与单个整体应用程序相比,这种面向服务的架构的优势主要在于通过松散耦合实现组织可扩展性:不同的团队可以处理不同的服务,这减少了团队之间的协调工作(只要服务可以独立部署和更新)。

The currently trendy style of application development involves breaking down functionality into a set of services that communicate via synchronous network requests such as REST APIs (see “Dataflow Through Services: REST and RPC”). The advantage of such a service-oriented architecture over a single monolithic application is primarily organizational scalability through loose coupling: different teams can work on different services, which reduces coordination effort between teams (as long as the services can be deployed and updated independently).

将流运算符组合到数据流系统中与微服务方法[ 40 ]有很多相似的特征。但是,底层通信机制非常不同:单向异步消息流而不是同步请求/响应交互。

Composing stream operators into dataflow systems has a lot of similar characteristics to the microservices approach [40]. However, the underlying communication mechanism is very different: one-directional, asynchronous message streams rather than synchronous request/response interactions.

除了“消息传递数据流”中列出的优点(例如更好的容错能力)之外,数据流系统还可以获得更好的性能。例如,假设客户正在购买以一种货币定价但以另一种货币付款的商品。为了执行货币换算,您需要知道当前的汇率。该操作可以通过两种方式实现[ 40 , 41 ]:

Besides the advantages listed in “Message-Passing Dataflow”, such as better fault tolerance, dataflow systems can also achieve better performance. For example, say a customer is purchasing an item that is priced in one currency but paid for in another currency. In order to perform the currency conversion, you need to know the current exchange rate. This operation could be implemented in two ways [40, 41]:

  1. 在微服务方法中,处理购买的代码可能会查询汇率服务或数据库,以获得特定货币的当前汇率。

  2. In the microservices approach, the code that processes the purchase would probably query an exchange-rate service or database in order to obtain the current rate for a particular currency.

  3. 在数据流方法中,处理购买的代码将提前订阅汇率更新流,并在当前汇率发生变化时将其记录在本地数据库中。在处理购买时,只需要查询本地数据库即可。

  4. In the dataflow approach, the code that processes purchases would subscribe to a stream of exchange rate updates ahead of time, and record the current rate in a local database whenever it changes. When it comes to processing the purchase, it only needs to query the local database.

第二种方法用对本地数据库(可能位于同一台机器上,甚至在同一进程中)的查询替换了对另一个服务的同步网络请求。ii数据流方法不仅更快,而且对于其他服务的故障也更稳健。最快、最可靠的网络请求就是根本没有网络请求!现在,我们不再使用 RPC,而是在购买事件和汇率更新事件之间进行流连接(请参阅“流表连接(流丰富)”)。

The second approach has replaced a synchronous network request to another service with a query to a local database (which may be on the same machine, even in the same process).ii Not only is the dataflow approach faster, but it is also more robust to the failure of another service. The fastest and most reliable network request is no network request at all! Instead of RPC, we now have a stream join between purchase events and exchange rate update events (see “Stream-table join (stream enrichment)”).

连接与时间相关:如果稍后重新处理购买事件,则汇率将会发生变化。如果您想重建原始输出,则需要获取原始购买时的历史汇率。无论您是查询服务还是订阅汇率更新流,您都需要处理这种时间依赖性(请参阅“连接的时间依赖性”)。

The join is time-dependent: if the purchase events are reprocessed at a later point in time, the exchange rate will have changed. If you want to reconstruct the original output, you will need to obtain the historical exchange rate at the original time of purchase. No matter whether you query a service or subscribe to a stream of exchange rate updates, you will need to handle this time dependence (see “Time-dependence of joins”).

订阅变化流,而不是在需要时查询当前状态,使我们更接近类似电子表格的计算模型:当某些数据发生变化时,任何依赖于它的派生数据都可以快速更新。仍然有许多悬而未决的问题,例如围绕时间相关连接等问题,但我相信围绕数据流思想构建应用程序是一个非常有前途的方向。

Subscribing to a stream of changes, rather than querying the current state when needed, brings us closer to a spreadsheet-like model of computation: when some piece of data changes, any derived data that depends on it can swiftly be updated. There are still many open questions, for example around issues like time-dependent joins, but I believe that building applications around dataflow ideas is a very promising direction to go in.

观察导出状态

Observing Derived State

在抽象层面上,上一节中讨论的数据流系统为您提供了创建派生数据集(例如搜索索引、物化视图和预测模型)并使其保持最新的过程。我们将该过程称为写入路径:每当将某些信息写入系统时,它可能会经历批处理和流处理的多个阶段,最终更新每个派生数据集以合并写入的数据。图12-1显示了更新搜索索引的示例。

At an abstract level, the dataflow systems discussed in the last section give you a process for creating derived datasets (such as search indexes, materialized views, and predictive models) and keeping them up to date. Let’s call that process the write path: whenever some piece of information is written to the system, it may go through multiple stages of batch and stream processing, and eventually every derived dataset is updated to incorporate the data that was written. Figure 12-1 shows an example of updating a search index.

迪迪亚1201
图 12-1。在搜索索引中,写入(文档更新)满足读取(查询)。

但为什么首先要创建派生数据集呢?很可能是因为您想稍后再次查询它。这是读取路径:当服务用户请求时,您从派生数据集中读取,也许对结果执行更多处理,并构建对用户的响应。

But why do you create the derived dataset in the first place? Most likely because you want to query it again at a later time. This is the read path: when serving a user request you read from the derived dataset, perhaps perform some more processing on the results, and construct the response to the user.

总的来说,写入路径和读取路径涵盖了数据的整个旅程,从收集数据的点到使用数据的点(可能由另一个人)。写入路径是预先计算的旅程的一部分,即数据一进入就立即完成,无论是否有人要求查看它。阅读路径是旅程中仅在有人请求时才会发生的部分。如果您熟悉函数式编程语言,您可能会注意到写入路径类似于急切求值,而读取路径类似于惰性求值。

Taken together, the write path and the read path encompass the whole journey of the data, from the point where it is collected to the point where it is consumed (probably by another human). The write path is the portion of the journey that is precomputed—i.e., that is done eagerly as soon as the data comes in, regardless of whether anyone has asked to see it. The read path is the portion of the journey that only happens when someone asks for it. If you are familiar with functional programming languages, you might notice that the write path is similar to eager evaluation, and the read path is similar to lazy evaluation.

导出的数据集是写入路径和读取路径交汇的地方,如图 12-1所示。它代表了写入时需要完成的工作量和读取时需要完成的工作量之间的权衡。

The derived dataset is the place where the write path and the read path meet, as illustrated in Figure 12-1. It represents a trade-off between the amount of work that needs to be done at write time and the amount that needs to be done at read time.

物化视图和缓存

Materialized views and caching

全文搜索索引就是一个很好的例子:写入路径更新索引,读取路径在索引中搜索关键字。读和写都需要做一些工作。写入需要更新文档中出现的所有术语的索引条目。读取需要搜索查询中的每个单词,并应用布尔逻辑来查找包含查询中的所有单词(AND运算符)或每个单词的任何OR同义词(运算符)的文档。

A full-text search index is a good example: the write path updates the index, and the read path searches the index for keywords. Both reads and writes need to do some work. Writes need to update the index entries for all terms that appear in the document. Reads need to search for each of the words in the query, and apply Boolean logic to find documents that contain all of the words in the query (an AND operator), or any synonym of each of the words (an OR operator).

如果您没有索引,搜索查询将必须扫描所有文档(例如grep),如果您有大量文档,这将变得非常昂贵。没有索引意味着写入路径上的工作较少(没有要更新的索引),但读取路径上的工作较多。

If you didn’t have an index, a search query would have to scan over all documents (like grep), which would get very expensive if you had a large number of documents. No index means less work on the write path (no index to update), but a lot more work on the read path.

另一方面,您可以想象预先计算所有可能查询的搜索结果。在这种情况下,您在读取路径上要做的工作就会减少:没有布尔逻辑,只需找到查询结果并返回它们。然而,写入路径的成本会高得多:可以询问的可能搜索查询集是无限的,因此预先计算所有可能的搜索结果将需要无限的时间和存储空间。那效果不太好。三、

On the other hand, you could imagine precomputing the search results for all possible queries. In that case, you would have less work to do on the read path: no Boolean logic, just find the results for your query and return them. However, the write path would be a lot more expensive: the set of possible search queries that could be asked is infinite, and thus precomputing all possible search results would require infinite time and storage space. That wouldn’t work so well.iii

另一种选择是仅预先计算一组固定的最常见查询的搜索结果,以便可以快速提供它们而无需访问索引。不常见的查询仍然可以通过索引来提供。这通常被称为公共查询的缓存,尽管我们也可以将其称为物化视图,因为当出现应包含在公共查询之一的结果中的新文档时,需要更新它。

Another option would be to precompute the search results for only a fixed set of the most common queries, so that they can be served quickly without having to go to the index. The uncommon queries can still be served from the index. This would generally be called a cache of common queries, although we could also call it a materialized view, as it would need to be updated when new documents appear that should be included in the results of one of the common queries.

从这个例子中我们可以看到索引并不是写路径和读路径之间唯一可能的边界。可以缓存常见的搜索结果,并且grep也可以对少量文档进行无索引的类似扫描。这样看来,缓存、索引和物化视图的作用很简单:它们移动了读取路径和写入路径之间的边界。它们允许我们通过预先计算结果在写入路径上做更多的工作,以节省读取路径上的精力。

From this example we can see that an index is not the only possible boundary between the write path and the read path. Caching of common search results is possible, and grep-like scanning without the index is also possible on a small number of documents. Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path. They allow us to do more work on the write path, by precomputing results, in order to save effort on the read path.

事实上,改变写入路径和读取路径上完成的工作之间的界限是本书开头“描述负载” 中 Twitter 示例的主题。在该示例中,我们还看到了与普通用户相比,名人的写入路径和读取路径之间的边界可能会有所不同。500 页后,我们已经回到原点了!

Shifting the boundary between work done on the write path and the read path was in fact the topic of the Twitter example at the beginning of this book, in “Describing Load”. In that example, we also saw how the boundary between write path and read path might be drawn differently for celebrities compared to ordinary users. After 500 pages we have come full circle!

有状态、支持离线的客户端

Stateful, offline-capable clients

我发现写入和读取路径之间的边界的想法很有趣,因为我们可以讨论改变该边界并探索这种转变在实际中意味着什么。让我们在不同的背景下看看这个想法。

I find the idea of a boundary between write and read paths interesting because we can discuss shifting that boundary and explore what that shift means in practical terms. Let’s look at the idea in a different context.

过去二十年中 Web 应用程序的巨大普及使我们对应用程序开发做出了某些很容易想当然的假设。特别是,客户端/服务器模型(其中客户端基本上是无状态的,服务器拥有数据的权限)是如此普遍,以至于我们几乎忘记了其他任何东西的存在。然而,技术不断发展,我认为时不时地质疑现状很重要。

The huge popularity of web applications in the last two decades has led us to certain assumptions about application development that are easy to take for granted. In particular, the client/server model—in which clients are largely stateless and servers have the authority over data—is so common that we almost forget that anything else exists. However, technology keeps moving on, and I think it is important to question the status quo from time to time.

传统上,网络浏览器是无状态客户端,只有在有互联网连接时才能执行有用的操作(离线时唯一可以执行的操作就是在之前在线时加载的页面中上下滚动)。然而,最近的“单页”JavaScript Web 应用程序已经获得了许多有状态功能,包括客户端用户界面交互和 Web 浏览器中的持久本地存储。移动应用程序同样可以在设备上存储大量状态,并且大多数用户交互不需要往返服务器。

Traditionally, web browsers have been stateless clients that can only do useful things when you have an internet connection (just about the only thing you could do offline was to scroll up and down in a page that you had previously loaded while online). However, recent “single-page” JavaScript web apps have gained a lot of stateful capabilities, including client-side user interface interaction and persistent local storage in the web browser. Mobile apps can similarly store a lot of state on the device and don’t require a round-trip to the server for most user interactions.

这些不断变化的功能重新引起了人们对离线优先应用程序的兴趣,这些应用程序尽可能使用同一设备上的本地数据库,而不需要互联网连接,并在网络连接可用时在后台与远程服务器同步[ 42 ]。由于移动设备通常具有缓慢且不可靠的蜂窝互联网连接,因此如果用户界面不必等待同步网络请求,并且应用程序大多离线工作(请参阅“离线操作的客户端”),这对用户来说是一个很大的 优势

These changing capabilities have led to a renewed interest in offline-first applications that do as much as possible using a local database on the same device, without requiring an internet connection, and sync with remote servers in the background when a network connection is available [42]. Since mobile devices often have slow and unreliable cellular internet connections, it’s a big advantage for users if their user interface does not have to wait for synchronous network requests, and if apps mostly work offline (see “Clients with offline operation”).

当我们放弃无状态客户端与中央数据库对话的假设,转向在最终用户设备上维护状态时,一个充满新机遇的世界就打开了。特别是,我们可以将设备上的状态视为服务器上状态的缓存。屏幕上的像素是客户端应用程序中模型对象的物化视图;模型对象是远程数据中心状态的本地副本[ 27 ]。

When we move away from the assumption of stateless clients talking to a central database and toward state that is maintained on end-user devices, a world of new opportunities opens up. In particular, we can think of the on-device state as a cache of state on the server. The pixels on the screen are a materialized view onto model objects in the client app; the model objects are a local replica of state in a remote datacenter [27].

将状态更改推送给客户端

Pushing state changes to clients

在典型的网页中,如果您在 Web 浏览器中加载页面,并且服务器上的数据随后发生更改,则浏览器不会发现更改,直到您重新加载页面。浏览器仅在某一时间点读取数据,假设它是静态的,它不会订阅来自服务器的更新。因此,设备上的状态是陈旧的缓存,除非您显式轮询更改,否则不会更新。(基于 HTTP 的提要订阅协议(例如 RSS)实际上只是轮询的基本形式。)

In a typical web page, if you load the page in a web browser and the data subsequently changes on the server, the browser does not find out about the change until you reload the page. The browser only reads the data at one point in time, assuming that it is static—it does not subscribe to updates from the server. Thus, the state on the device is a stale cache that is not updated unless you explicitly poll for changes. (HTTP-based feed subscription protocols like RSS are really just a basic form of polling.)

最近的协议已经超越了 HTTP 的基本请求/响应模式:服务器发送的事件(EventSource API)和 WebSocket 提供了通信通道,Web 浏览器可以通过该通道与服务器保持开放的 TCP 连接,并且服务器可以主动推送只要浏览器保持连接状态,就会向浏览器发送消息。这为服务器提供了一个机会,可以主动通知最终用户客户端其本地存储的状态的任何更改,从而减少客户端状态的过时性。

More recent protocols have moved beyond the basic request/response pattern of HTTP: server-sent events (the EventSource API) and WebSockets provide communication channels by which a web browser can keep an open TCP connection to a server, and the server can actively push messages to the browser as long as it remains connected. This provides an opportunity for the server to actively inform the end-user client about any changes to the state it has stored locally, reducing the staleness of the client-side state.

就我们的写入路径和读取路径模型而言,主动将状态更改一直推送到客户端设备意味着将写入路径一直延伸到最终用户。当客户端首次初始化时,它仍然需要使用读取路径来获取其初始状态,但此后它可以依赖服务器发送的状态更改流。我们围绕流处理和消息传递讨论的想法并不局限于仅在数据中心中运行:我们可以进一步扩展这些想法,并将它们一直扩展到最终用户设备[ 43 ]。

In terms of our model of write path and read path, actively pushing state changes all the way to client devices means extending the write path all the way to the end user. When a client is first initialized, it would still need to use a read path to get its initial state, but thereafter it could rely on a stream of state changes sent by the server. The ideas we discussed around stream processing and messaging are not restricted to running only in a datacenter: we can take the ideas further, and extend them all the way to end-user devices [43].

设备有时会处于离线状态,并且在此期间无法从服务器接收任何状态更改通知。但我们已经解决了这个问题:在 “消费者偏移”中,我们讨论了基于日志的消息代理的消费者如何在失败或断开连接后重新连接,并确保它不会错过在断开连接时到达的任何消息。同样的技术也适用于个人用户,其中每个设备都是一小部分事件流的小订阅者。

The devices will be offline some of the time, and unable to receive any notifications of state changes from the server during that time. But we already solved that problem: in “Consumer offsets” we discussed how a consumer of a log-based message broker can reconnect after failing or becoming disconnected, and ensure that it doesn’t miss any messages that arrived while it was disconnected. The same technique works for individual users, where each device is a small subscriber to a small stream of events.

端到端事件流

End-to-end event streams

最近用于开发有状态客户端和用户界面的工具,例如 Elm 语言 [ 30 ] 和 Facebook 的 React、Flux 和 Redux 工具链 [ 44 ],已经通过订阅表示用户输入或事件的流来管理内部客户端状态。来自服务器的响应,其结构与事件源类似(请参阅 “事件源”)。

Recent tools for developing stateful clients and user interfaces, such as the Elm language [30] and Facebook’s toolchain of React, Flux, and Redux [44], already manage internal client-side state by subscribing to a stream of events representing user input or responses from a server, structured similarly to event sourcing (see “Event Sourcing”).

扩展此编程模型以允许服务器将状态更改事件推送到此客户端事件管道中是非常自然的。因此,状态更改可以通过端到端写入路径流动:从触发状态更改的一台设备上的交互,通过事件日志并通过多个派生数据系统和流处理器,一直到一个设备的用户界面。观察另一台设备上的状态的人。这些状态变化可以以相当低的延迟传播——例如,在一秒内端到端。

It would be very natural to extend this programming model to also allow a server to push state-change events into this client-side event pipeline. Thus, state changes could flow through an end-to-end write path: from the interaction on one device that triggers a state change, via event logs and through several derived data systems and stream processors, all the way to the user interface of a person observing the state on another device. These state changes could be propagated with fairly low delay—say, under one second end to end.

一些应用程序,例如即时通讯和网络游戏,已经具有这样的“实时”架构(是指低延迟交互的意义上,而不是“响应时间保证” 的意义上 )。但为什么我们不以这种方式构建所有应用程序呢?

Some applications, such as instant messaging and online games, already have such a “real-time” architecture (in the sense of interactions with low delay, not in the sense of “Response time guarantees”). But why don’t we build all applications this way?

挑战在于无状态客户端和请求/响应交互的假设在我们的数据库、库、框架和协议中根深蒂固。许多数据存储支持请求返回一个响应的读取和写入操作,但提供订阅更改的能力的数据存储要少得多,即随着时间的推移返回响应流的请求(请参阅“更改流的 API 支持”

The challenge is that the assumption of stateless clients and request/response interactions is very deeply ingrained in our databases, libraries, frameworks, and protocols. Many datastores support read and write operations where a request returns one response, but much fewer provide an ability to subscribe to changes—i.e., a request that returns a stream of responses over time (see “API support for change streams”).

为了将写入路径一直延伸到最终用户,我们需要从根本上重新思考构建许多此类系统的方式:从请求/响应交互转向发布/订阅数据流[27 ]。我认为响应更快的用户界面和更好的离线支持的优势值得付出努力。如果您正在设计数据系统,我希望您记住订阅更改的选项,而不仅仅是查询当前状态。

In order to extend the write path all the way to the end user, we would need to fundamentally rethink the way we build many of these systems: moving away from request/response interaction and toward publish/subscribe dataflow [27]. I think that the advantages of more responsive user interfaces and better offline support would make it worth the effort. If you are designing data systems, I hope that you will keep in mind the option of subscribing to changes, not just querying the current state.

读取也是事件

Reads are events too

我们讨论了当流处理器将派生数据写入存储(数据库、缓存或索引)时,以及当用户请求查询该存储时,该存储充当写入路径和读取路径之间的边界。该存储允许对数据进行随机访问读取查询,否则需要扫描整个事件日志。

We discussed that when a stream processor writes derived data to a store (database, cache, or index), and when user requests query that store, the store acts as the boundary between the write path and the read path. The store allows random-access read queries to the data that would otherwise require scanning the whole event log.

在许多情况下,数据存储与流系统是分开的。但请记住,流处理器还需要维护状态来执行聚合和连接(请参阅“流连接”)。这种状态通常隐藏在流处理器内部,但某些框架也允许外部客户端查询它[ 45 ],将流处理器本身变成一种简单的数据库。

In many cases, the data storage is separate from the streaming system. But recall that stream processors also need to maintain state to perform aggregations and joins (see “Stream Joins”). This state is normally hidden inside the stream processor, but some frameworks allow it to also be queried by outside clients [45], turning the stream processor itself into a kind of simple database.

我想进一步发展这个想法。正如到目前为止所讨论的,对存储的写入通过事件日志进行,而读取是直接发送到存储所查询数据的节点的瞬态网络请求。这是一种合理的设计,但不是唯一可能的设计。还可以将读请求表示为事件流,并通过流处理器发送读事件和写事件;处理器通过将读取结果发送到输出流来响应读取事件[ 46 ]。

I would like to take that idea further. As discussed so far, the writes to the store go through an event log, while reads are transient network requests that go directly to the nodes that store the data being queried. This is a reasonable design, but not the only possible one. It is also possible to represent read requests as streams of events, and send both the read events and the write events through a stream processor; the processor responds to read events by emitting the result of the read to an output stream [46].

当写入和读取都表示为事件并路由到同一个流运算符以便进行处理时,我们实际上是在读取查询流和数据库之间执行流表连接。读取事件需要发送到保存数据的数据库分区(请参阅“请求路由”),就像批处理和流处理器在加入时需要在同一键上对输入进行共同分区一样(请参阅“Reduce-Side Joins and Grouping”)。

When both the writes and the reads are represented as events, and routed to the same stream operator in order to be handled, we are in fact performing a stream-table join between the stream of read queries and the database. The read event needs to be sent to the database partition holding the data (see “Request Routing”), just like batch and stream processors need to copartition inputs on the same key when joining (see “Reduce-Side Joins and Grouping”).

服务请求和执行连接之间的这种对应关系是非常基本的[ 47 ]。一次性读取请求只是将请求传递给连接运算符,然后立即忘记它;订阅请求是与连接另一端的过去和未来事件的持久连接。

This correspondence between serving requests and performing joins is quite fundamental [47]. A one-off read request just passes the request through the join operator and then immediately forgets it; a subscribe request is a persistent join with past and future events on the other side of the join.

记录读取事件的日志对于跟踪整个系统的因果依赖性和数据来源也可能有好处:它允许您在用户做出特定决定之前重建用户所看到的内容。例如,在在线商店中,向客户显示的预测发货日期和库存状态可能会影响他们是否选择购买商品 [ 4 ]。为了分析这个连接,你需要记录用户查询发货和库存状态的结果。

Recording a log of read events potentially also has benefits with regard to tracking causal dependencies and data provenance across a system: it would allow you to reconstruct what the user saw before they made a particular decision. For example, in an online shop, it is likely that the predicted shipping date and the inventory status shown to a customer affect whether they choose to buy an item [4]. To analyze this connection, you need to record the result of the user’s query of the shipping and inventory status.

因此,将读取事件写入持久存储可以更好地跟踪因果依赖性(请参阅 “对事件进行排序以捕获因果关系”),但会产生额外的存储和 I/O 成本。优化此类系统以减少开销仍然是一个开放的研究问题[ 2 ]。但是,如果您已经出于操作目的记录读取请求,作为请求处理的副作用,那么将日志作为请求的来源并不是一个很大的改变。

Writing read events to durable storage thus enables better tracking of causal dependencies (see “Ordering events to capture causality”), but it incurs additional storage and I/O cost. Optimizing such systems to reduce the overhead is still an open research problem [2]. But if you already log read requests for operational purposes, as a side effect of request processing, it is not such a great change to make the log the source of the requests instead.

多分区数据处理

Multi-partition data processing

对于仅涉及单个分区的查询,通过流发送查询并收集响应流的工作可能有些过大。然而,这个想法开启了分布式执行复杂查询的可能性,这些查询需要组合来自多个分区的数据,利用流处理器已经提供的消息路由、分区和连接的基础设施。

For queries that only touch a single partition, the effort of sending queries through a stream and collecting a stream of responses is perhaps overkill. However, this idea opens the possibility of distributed execution of complex queries that need to combine data from several partitions, taking advantage of the infrastructure for message routing, partitioning, and joining that is already provided by stream processors.

Storm 的分布式 RPC 功能支持这种使用模式(请参阅“消息传递和 RPC”)。例如,它已被用来计算在 Twitter 上看到过某个 URL 的人数,即发布过该 URL 的每个人的关注者集的并集 [48 ]。由于 Twitter 用户集是分区的,因此此计算需要组合来自多个分区的结果。

Storm’s distributed RPC feature supports this usage pattern (see “Message passing and RPC”). For example, it has been used to compute the number of people who have seen a URL on Twitter—i.e., the union of the follower sets of everyone who has tweeted that URL [48]. As the set of Twitter users is partitioned, this computation requires combining results from many partitions.

这种模式的另一个例子出现在欺诈预防中:为了评估特定购买事件是否具有欺诈性的风险,您可以检查用户的 IP 地址、电子邮件地址、帐单地址、送货地址等的信誉评分。这些声誉数据库中的每一个本身都是分区的,因此收集特定购买事件的分数需要与不同分区的数据集进行一系列连接[ 49 ]。

Another example of this pattern occurs in fraud prevention: in order to assess the risk of whether a particular purchase event is fraudulent, you can examine the reputation scores of the user’s IP address, email address, billing address, shipping address, and so on. Each of these reputation databases is itself partitioned, and so collecting the scores for a particular purchase event requires a sequence of joins with differently partitioned datasets [49].

MPP数据库的内部查询执行图具有类似的特征(参见 “Hadoop与分布式数据库的比较”)。如果您需要执行这种多分区联接,那么使用提供此功能的数据库可能比使用流处理器实现它更简单。然而,将查询视为流提供了一种实现大规模应用程序的选项,这些应用程序在传统现成解决方案的限制下运行。

The internal query execution graphs of MPP databases have similar characteristics (see “Comparing Hadoop to Distributed Databases”). If you need to perform this kind of multi-partition join, it is probably simpler to use a database that provides this feature than to implement it using a stream processor. However, treating queries as streams provides an option for implementing large-scale applications that run against the limits of conventional off-the-shelf solutions.

力求正确

Aiming for Correctness

对于仅读取数据的无状态服务,如果出现问题也没什么大不了的:您可以修复错误并重新启动服务,一切都会恢复正常。像数据库这样的有状态系统并不是那么简单:它们被设计为永远(或多或少)记住事情,所以如果出现问题,影响也可能永远持续——这意味着它们需要更仔细的思考[50 ]

With stateless services that only read data, it is not a big deal if something goes wrong: you can fix the bug and restart the service, and everything returns to normal. Stateful systems such as databases are not so simple: they are designed to remember things forever (more or less), so if something goes wrong, the effects also potentially last forever—which means they require more careful thought [50].

我们希望构建可靠且正确的应用程序(即,即使面对各种错误,其语义也被很好地定义和理解)。大约四十年来,原子性、隔离性和持久性等事务属性(第 7 章)一直是构建正确应用程序的首选工具。然而,这些基础比看起来要弱:例如弱隔离级别的混乱(参见 “弱隔离级别”)。

We want to build applications that are reliable and correct (i.e., programs whose semantics are well defined and understood, even in the face of various faults). For approximately four decades, the transaction properties of atomicity, isolation, and durability (Chapter 7) have been the tools of choice for building correct applications. However, those foundations are weaker than they seem: witness for example the confusion of weak isolation levels (see “Weak Isolation Levels”).

在某些领域,事务被完全放弃,并被提供更好性能和可扩展性的模型所取代,但语义更加混乱(例如参见 “无头复制”)。一致性经常被谈论,但定义却很差(参见 “一致性”第 9 章)。有些人断言,为了更好的可用性,我们应该“拥抱弱一致性”,但对其在实践中的实际含义缺乏清晰的认识。

In some areas, transactions are being abandoned entirely and replaced with models that offer better performance and scalability, but much messier semantics (see for example “Leaderless Replication”). Consistency is often talked about, but poorly defined (see “Consistency” and Chapter 9). Some people assert that we should “embrace weak consistency” for the sake of better availability, while lacking a clear idea of what that actually means in practice.

对于一个如此重要的主题,我们的理解和工程方法却出奇地不稳定。例如,很难确定在特定事务隔离级别或复制配置下运行特定应用程序是否安全[ 51 , 52 ]。通常,当并发性较低且没有故障时,简单的解决方案似乎可以正常工作,但在要求更高的情况下却会出现许多微妙的错误。

For a topic that is so important, our understanding and our engineering methods are surprisingly flaky. For example, it is very difficult to determine whether it is safe to run a particular application at a particular transaction isolation level or replication configuration [51, 52]. Often simple solutions appear to work correctly when concurrency is low and there are no faults, but turn out to have many subtle bugs in more demanding circumstances.

例如,Kyle Kingsbury 的 Jepsen 实验 [ 53 ] 强调了某些产品声称的安全保证与其在出现网络问题和崩溃时的实际行为之间的明显差异。即使像数据库这样的基础设施产品没有问题,应用程序代码仍然需要正确使用它们提供的功能,如果配置难以理解(弱隔离级别、仲裁配置和很快)。

For example, Kyle Kingsbury’s Jepsen experiments [53] have highlighted the stark discrepancies between some products’ claimed safety guarantees and their actual behavior in the presence of network problems and crashes. Even if infrastructure products like databases were free from problems, application code would still need to correctly use the features they provide, which is error-prone if the configuration is hard to understand (which is the case with weak isolation levels, quorum configurations, and so on).

如果您的应用程序可以容忍偶尔以不可预测的方式损坏或丢失数据,那么生活就会简单得多,您可能只需祈祷并祈祷最好的结果就可以逃脱惩罚。另一方面,如果您需要更强的正确性保证,那么可序列化和原子提交是既定的方法,但它们是有代价的:它们通常只能在单个数据中心中工作(排除地理分布式架构),并且它们限制了规模以及您可以实现的容错特性。

If your application can tolerate occasionally corrupting or losing data in unpredictable ways, life is a lot simpler, and you might be able to get away with simply crossing your fingers and hoping for the best. On the other hand, if you need stronger assurances of correctness, then serializability and atomic commit are established approaches, but they come at a cost: they typically only work in a single datacenter (ruling out geographically distributed architectures), and they limit the scale and fault-tolerance properties you can achieve.

虽然传统的事务方法不会消失,但我也相信它并不是使应用程序正确且具有故障恢复能力的最终决定。在本节中,我将建议一些在数据流架构背景下思考正确性的方法。

While the traditional transaction approach is not going away, I also believe it is not the last word in making applications correct and resilient to faults. In this section I will suggest some ways of thinking about correctness in the context of dataflow architectures.

数据库的端到端争论

The End-to-End Argument for Databases

仅仅因为应用程序使用提供相对较强的安全属性(例如可序列化事务)的数据系统,并不意味着应用程序可以保证不会丢失或损坏数据。例如,如果应用程序存在错误,导致其写入不正确的数据或从数据库中删除数据,则可序列化事务不会拯救您。

Just because an application uses a data system that provides comparatively strong safety properties, such as serializable transactions, that does not mean the application is guaranteed to be free from data loss or corruption. For example, if an application has a bug that causes it to write incorrect data, or delete data from a database, serializable transactions aren’t going to save you.

这个例子可能看起来很无聊,但值得认真对待:应用程序错误会发生,人们也会犯错误。我在“状态、流和不可变性”中使用了这个例子来支持不可变和仅附加数据,因为如果消除错误代码破坏良好数据的能力,就更容易从此类错误中恢复。

This example may seem frivolous, but it is worth taking seriously: application bugs occur, and people make mistakes. I used this example in “State, Streams, and Immutability” to argue in favor of immutable and append-only data, because it is easier to recover from such mistakes if you remove the ability of faulty code to destroy good data.

尽管不变性很有用,但它本身并不是包治百病的灵丹妙药。让我们看一下可能发生的数据损坏的更微妙的示例。

Although immutability is useful, it is not a cure-all by itself. Let’s look at a more subtle example of data corruption that can occur.

操作仅执行一次

Exactly-once execution of an operation

“容错”中,我们遇到了一个称为恰好一次(或 有效一次)语义的想法。如果处理消息时出现问题,您可以放弃(丢弃消息,即导致数据丢失)或重试。如果您重试,则存在第一次实际上成功的风险,但您只是没有发现成功,因此该消息最终被处理了两次。

In “Fault Tolerance” we encountered an idea called exactly-once (or effectively-once) semantics. If something goes wrong while processing a message, you can either give up (drop the message—i.e., incur data loss) or try again. If you try again, there is the risk that it actually succeeded the first time, but you just didn’t find out about the success, and so the message ends up being processed twice.

处理两次是数据损坏的一种形式:对同一服务向客户收取两次费用(向他们收取过高费用)或增加计数器两次(夸大某些指标)是不可取的。在这种情况下,恰好一次意味着安排计算,使得最终效果与没有发生故障一样,即使该操作实际上由于某些故障而重试。我们之前讨论了实现这一目标的几种方法。

Processing twice is a form of data corruption: it is undesirable to charge a customer twice for the same service (billing them too much) or increment a counter twice (overstating some metric). In this context, exactly-once means arranging the computation such that the final effect is the same as if no faults had occurred, even if the operation actually was retried due to some fault. We previously discussed a few approaches for achieving this goal.

最有效的方法之一是使操作幂等(参见 “幂等”);即保证无论执行一次还是多次,效果都是一样的。但是,采用并非自然幂等的操作并使其成为幂等需要付出一些努力和小心:您可能需要维护一些额外的元数据(例如已更新值的操作 ID 集),并确保在故障转移时进行隔离一个节点到另一个节点(参见“领导者和锁”)。

One of the most effective approaches is to make the operation idempotent (see “Idempotence”); that is, to ensure that it has the same effect, no matter whether it is executed once or multiple times. However, taking an operation that is not naturally idempotent and making it idempotent requires some effort and care: you may need to maintain some additional metadata (such as the set of operation IDs that have updated a value), and ensure fencing when failing over from one node to another (see “The leader and the lock”).

重复抑制

Duplicate suppression

除了流处理之外,需要抑制重复的相同模式还发生在许多其他地方。例如,TCP 使用数据包上的序列号将它们按接收者的正确顺序排列,并确定网络上是否有任何数据包丢失或重复。TCP 堆栈在将数据传递给应用程序之前,会重新传输所有丢失的数据包,并删除所有重复数据包。

The same pattern of needing to suppress duplicates occurs in many other places besides stream processing. For example, TCP uses sequence numbers on packets to put them in the correct order at the recipient, and to determine whether any packets were lost or duplicated on the network. Any lost packets are retransmitted and any duplicates are removed by the TCP stack before it hands the data to an application.

然而,这种重复抑制仅在单个 TCP 连接的上下文中起作用。假设 TCP 连接是客户端与数据库的连接,并且它当前正在执行示例 12-1中的事务。在许多数据库中,事务与客户端连接相关联(如果客户端发送多个查询,数据库知道它们属于同一事务,因为它们是在同一 TCP 连接上发送的)。如果客户端在发送后发生网络中断和连接超时COMMIT,但在收到数据库服务器的回复之前,它不知道事务是已提交还是已中止(图8-1)。

However, this duplicate suppression only works within the context of a single TCP connection. Imagine the TCP connection is a client’s connection to a database, and it is currently executing the transaction in Example 12-1. In many databases, a transaction is tied to a client connection (if the client sends several queries, the database knows that they belong to the same transaction because they are sent on the same TCP connection). If the client suffers a network interruption and connection timeout after sending the COMMIT, but before hearing back from the database server, it does not know whether the transaction has been committed or aborted (Figure 8-1).

例12-1。从一个账户到另一个账户的非幂等转账
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;

客户端可以重新连接数据库并重试事务,但现在超出了 TCP 重复抑制的范围。由于示例 12-1中的交易不是幂等的,因此可能会传输 22 美元,而不是所需的 11 美元。因此,尽管 例 12-1是交易原子性的标准示例,但它实际上是不正确的,真正的银行并不是这样工作的 [ 3 ]。

The client can reconnect to the database and retry the transaction, but now it is outside of the scope of TCP duplicate suppression. Since the transaction in Example 12-1 is not idempotent, it could happen that $22 is transferred instead of the desired $11. Thus, even though Example 12-1 is a standard example for transaction atomicity, it is actually not correct, and real banks do not work like this [3].

两阶段提交(请参阅“原子提交和两阶段提交 (2PC)”)协议打破了 TCP 连接和事务之间的 1:1 映射,因为它们必须允许事务协调器在网络故障后重新连接到数据库,并告诉它是否提交或中止有疑问的事务。这足以确保交易只会执行一次吗?不幸的是没有。

Two-phase commit (see “Atomic Commit and Two-Phase Commit (2PC)”) protocols break the 1:1 mapping between a TCP connection and a transaction, since they must allow a transaction coordinator to reconnect to a database after a network fault, and tell it whether to commit or abort an in-doubt transaction. Is this sufficient to ensure that the transaction will only be executed once? Unfortunately not.

即使我们可以抑制数据库客户端和服务器之间的重复事务,我们仍然需要担心最终用户设备和应用程序服务器之间的网络。例如,如果最终用户客户端是 Web 浏览器,它可能使用 HTTP POST 请求向服务器提交指令。也许用户的蜂窝数据连接较弱,并且他们成功发送了 POST,但在他们能够接收来自服务器的响应之前信号变得太弱。

Even if we can suppress duplicate transactions between the database client and server, we still need to worry about the network between the end-user device and the application server. For example, if the end-user client is a web browser, it probably uses an HTTP POST request to submit an instruction to the server. Perhaps the user is on a weak cellular data connection, and they succeed in sending the POST, but the signal becomes too weak before they are able to receive the response from the server.

在这种情况下,用户可能会看到一条错误消息,并且他们可以手动重试。Web 浏览器警告:“您确定要再次提交此表单吗?”,用户说“是”,因为他们希望执行该操作。(Post/Redirect/Get 模式 [ 54 ] 在正常操作中避免了此警告消息,但如果 POST 请求超时则无济于事。)从 Web 服务器的角度来看,重试是一个单独的请求,并且从 Web 服务器的角度来看,重试是一个单独的请求。从数据库的角度来看它是一个单独的事务。通常的重复数据删除机制没有帮助。

In this case, the user will probably be shown an error message, and they may retry manually. Web browsers warn, “Are you sure you want to submit this form again?”—and the user says yes, because they wanted the operation to happen. (The Post/Redirect/Get pattern [54] avoids this warning message in normal operation, but it doesn’t help if the POST request times out.) From the web server’s point of view the retry is a separate request, and from the database’s point of view it is a separate transaction. The usual deduplication mechanisms don’t help.

操作标识符

Operation identifiers

为了使操作通过几跳网络通信实现幂等,仅依靠数据库提供的事务机制是不够的,您需要考虑 请求的端到端流程。

To make the operation idempotent through several hops of network communication, it is not sufficient to rely just on a transaction mechanism provided by a database—you need to consider the end-to-end flow of the request.

例如,您可以为操作生成唯一标识符(例如 UUID)并将其作为隐藏表单字段包含在客户端应用程序中,或者计算所有相关表单字段的哈希值以派生操作 ID [3 ]。如果Web浏览器提交POST请求两次,则两次请求将具有相同的操作ID。然后,您可以将该操作 ID 一直传递到数据库,并检查您是否只使用给定 ID 执行过一项操作,如示例 12-2所示。

For example, you could generate a unique identifier for an operation (such as a UUID) and include it as a hidden form field in the client application, or calculate a hash of all the relevant form fields to derive the operation ID [3]. If the web browser submits the POST request twice, the two requests will have the same operation ID. You can then pass that operation ID all the way through to the database and check that you only ever execute one operation with a given ID, as shown in Example 12-2.

例12-2。使用唯一 ID 抑制重复请求
ALTER TABLE requests ADD UNIQUE (request_id);

BEGIN TRANSACTION;

INSERT INTO requests
  (request_id, from_account, to_account, amount)
  VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);

UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;

COMMIT;
ALTER TABLE requests ADD UNIQUE (request_id);

BEGIN TRANSACTION;

INSERT INTO requests
  (request_id, from_account, to_account, amount)
  VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);

UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;

COMMIT;

示例 12-2依赖于列的唯一性约束request_id。如果事务尝试插入已存在的 ID,则会INSERT失败并且事务将中止,从而防止其两次生效。关系数据库通常可以正确地维护唯一性约束,即使在弱隔离级别也是如此(而应用程序级别的检查然后插入可能会在不可序列化隔离下失败,如“写入倾斜和幻影”中所述

Example 12-2 relies on a uniqueness constraint on the request_id column. If a transaction attempts to insert an ID that already exists, the INSERT fails and the transaction is aborted, preventing it from taking effect twice. Relational databases can generally maintain a uniqueness constraint correctly, even at weak isolation levels (whereas an application-level check-then-insert may fail under nonserializable isolation, as discussed in “Write Skew and Phantoms”).

除了抑制重复请求之外,示例 12-2requests中的表还充当一种事件日志,暗示事件溯源的方向(请参阅“事件溯源”)。帐户余额的更新实际上不必与插入事件发生在同一事务中,因为它们是冗余的,并且可以从下游消费者中的请求事件派生——只要该事件只处理一次,可以再次使用请求 ID 强制执行。

Besides suppressing duplicate requests, the requests table in Example 12-2 acts as a kind of event log, hinting in the direction of event sourcing (see “Event Sourcing”). The updates to the account balances don’t actually have to happen in the same transaction as the insertion of the event, since they are redundant and could be derived from the request event in a downstream consumer—as long as the event is processed exactly once, which can again be enforced using the request ID.

端到端的论证

The end-to-end argument

这种抑制重复交易的场景只是称为端到端论证的更普遍原则的一个例子,该原则由 Saltzer、Reed 和 Clark 于 1984 年阐述[ 55 ]:

This scenario of suppressing duplicate transactions is just one example of a more general principle called the end-to-end argument, which was articulated by Saltzer, Reed, and Clark in 1984 [55]:

只有在通信系统端点的应用程序的知识和帮助下,才能完全正确地实现所讨论的功能。因此,提供所质疑的功能作为通信系统本身的特征是不可能的。(有时,通信系统提供的功能的不完整版本可能有助于性能增强。)

The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)

在我们的示例中,所讨论的功能是重复抑制。我们看到 TCP 在 TCP 连接级别抑制了重复数据包,并且一些流处理器在消息处理级别提供了所谓的“恰好一次”语义,但这不足以防止用户提交重复请求(如果第一次)出去。TCP、数据库事务和流处理器本身并不能完全排除这些重复。解决这个问题需要一个端到端的解决方案:从最终用户客户端一直传递到数据库的事务标识符。

In our example, the function in question was duplicate suppression. We saw that TCP suppresses duplicate packets at the TCP connection level, and some stream processors provide so-called exactly-once semantics at the message processing level, but that is not enough to prevent a user from submitting a duplicate request if the first one times out. By themselves, TCP, database transactions, and stream processors cannot entirely rule out these duplicates. Solving the problem requires an end-to-end solution: a transaction identifier that is passed all the way from the end-user client to the database.

端到端的论点也适用于检查数据的完整性:以太网、TCP 和 TLS 中内置的校验和可以检测网络中数据包的损坏,但它们无法检测由于发送和接收时软件中的错误而导致的损坏网络连接结束,或存储数据的磁盘损坏。如果您想捕获所有可能的数据损坏源,您还需要端到端校验和。

The end-to-end argument also applies to checking the integrity of data: checksums built into Ethernet, TCP, and TLS can detect corruption of packets in the network, but they cannot detect corruption due to bugs in the software at the sending and receiving ends of the network connection, or corruption on the disks where the data is stored. If you want to catch all possible sources of data corruption, you also need end-to-end checksums.

类似的论点也适用于加密 [ 55 ]:您家庭 WiFi 网络上的密码可以防止人们窥探您的 WiFi 流量,但不能防止互联网上其他地方的攻击者;客户端和服务器之间的 TLS/SSL 可防止网络攻击者,但不能防止服务器受到损害。只有端到端加密和身份验证才能防范所有这些问题。

A similar argument applies with encryption [55]: the password on your home WiFi network protects against people snooping your WiFi traffic, but not against attackers elsewhere on the internet; TLS/SSL between your client and the server protects against network attackers, but not against compromises of the server. Only end-to-end encryption and authentication can protect against all of these things.

尽管低级功能(TCP 重复抑制、以太网校验和、WiFi 加密)本身无法提供所需的端到端功能,但它们仍然有用,因为它们降低了较高级别出现问题的可能性。例如,如果我们没有 TCP 将数据包按正确的顺序放回,HTTP 请求通常会被破坏。我们只需要记住,低级可靠性功能本身不足以确保端到端的正确性。

Although the low-level features (TCP duplicate suppression, Ethernet checksums, WiFi encryption) cannot provide the desired end-to-end features by themselves, they are still useful, since they reduce the probability of problems at the higher levels. For example, HTTP requests would often get mangled if we didn’t have TCP putting the packets back in the right order. We just need to remember that the low-level reliability features are not by themselves sufficient to ensure end-to-end correctness.

在数据系统中应用端到端思维

Applying end-to-end thinking in data systems

这让我回到了最初的论点:仅仅因为应用程序使用提供相对较强的安全属性(例如可序列化事务)的数据系统,并不意味着应用程序可以保证不会丢失或损坏数据。应用程序本身也需要采取端到端的措施,例如重复抑制。

This brings me back to my original thesis: just because an application uses a data system that provides comparatively strong safety properties, such as serializable transactions, that does not mean the application is guaranteed to be free from data loss or corruption. The application itself needs to take end-to-end measures, such as duplicate suppression, as well.

这是一种耻辱,因为容错机制很难正确。低级可靠性机制(例如 TCP 中的机制)工作得很好,因此其余的高级故障很少发生。将剩余的高级容错机制包装在抽象中真是太好了,这样应用程序代码就不必担心它,但我担心我们还没有找到正确的抽象。

That is a shame, because fault-tolerance mechanisms are hard to get right. Low-level reliability mechanisms, such as those in TCP, work quite well, and so the remaining higher-level faults occur fairly rarely. It would be really nice to wrap up the remaining high-level fault-tolerance machinery in an abstraction so that application code needn’t worry about it—but I fear that we have not yet found the right abstraction.

交易长期以来一直被视为一种很好的抽象,而且我确实相信它们是有用的。正如第 7 章简介中所讨论的,它们会处理各种可能的问题(并发写入、约束违规、崩溃、网络中断、磁盘故障),并将它们分解为两种可能的结果:提交或中止。这是编程模型的巨大简化,但我担心这还不够。

Transactions have long been seen as a good abstraction, and I do believe that they are useful. As discussed in the introduction to Chapter 7, they take a wide range of possible issues (concurrent writes, constraint violations, crashes, network interruptions, disk failures) and collapse them down to two possible outcomes: commit or abort. That is a huge simplification of the programming model, but I fear that it is not enough.

事务是昂贵的,特别是当它们涉及异构存储技术时(参见 “实践中的分布式事务”)。当我们因为成本太高而拒绝使用分布式事务时,我们最终不得不在应用程序代码中重新实现容错机制。正如本书中的大量示例所示,关于并发和部分失败的推理是困难且违反直觉的,因此我怀疑大多数应用程序级机制无法正常工作。结果是数据丢失或损坏。

Transactions are expensive, especially when they involve heterogeneous storage technologies (see “Distributed Transactions in Practice”). When we refuse to use distributed transactions because they are too expensive, we end up having to reimplement fault-tolerance mechanisms in application code. As numerous examples throughout this book have shown, reasoning about concurrency and partial failure is difficult and counterintuitive, and so I suspect that most application-level mechanisms do not work correctly. The consequence is lost or corrupted data.

出于这些原因,我认为值得探索容错抽象,它可以轻松提供特定于应用程序的端到端正确性属性,同时在大规模分布式环境中保持良好的性能和良好的操作特性。

For these reasons, I think it is worth exploring fault-tolerance abstractions that make it easy to provide application-specific end-to-end correctness properties, but also maintain good performance and good operational characteristics in a large-scale distributed environment.

强制约束

Enforcing Constraints

让我们在围绕分拆数据库( “分拆数据库” ) 的想法的背景下考虑正确性。我们看到,可以通过从客户端一路传递到记录写入的数据库的请求 ID 来实现端到端重复抑制。那么其他类型的约束呢?

Let’s think about correctness in the context of the ideas around unbundling databases (“Unbundling Databases”). We saw that end-to-end duplicate suppression can be achieved with a request ID that is passed all the way from the client to the database that records the write. What about other kinds of constraints?

特别是,让我们关注唯一性约束——例如我们在 示例 12-2中依赖的约束。在“约束和唯一性保证”中,我们看到了需要强制唯一性的应用程序功能的其他几个示例:用户名或电子邮件地址必须唯一地标识用户,文件存储服务不能有多个同名文件,以及两个人无法在航班或剧院预订相同的座位。

In particular, let’s focus on uniqueness constraints—such as the one we relied on in Example 12-2. In “Constraints and uniqueness guarantees” we saw several other examples of application features that need to enforce uniqueness: a username or email address must uniquely identify a user, a file storage service cannot have more than one file with the same name, and two people cannot book the same seat on a flight or in a theater.

其他类型的约束非常相似:例如,确保帐户余额永远不会变为负值,确保您销售的商品不超过仓库库存,或者会议室没有重叠的预订。强制唯一性的技术通常也可用于此类约束。

Other kinds of constraints are very similar: for example, ensuring that an account balance never goes negative, that you don’t sell more items than you have in stock in the warehouse, or that a meeting room does not have overlapping bookings. Techniques that enforce uniqueness can often be used for these kinds of constraints as well.

唯一性约束需要达成共识

Uniqueness constraints require consensus

第 9 章中,我们看到,在分布式环境中,强制执行唯一性约束需要达成共识:如果存在多个具有相同值的并发请求,系统需要以某种方式决定接受哪一个冲突操作,并将其他操作视为违规而拒绝的约束。

In Chapter 9 we saw that in a distributed setting, enforcing a uniqueness constraint requires consensus: if there are several concurrent requests with the same value, the system somehow needs to decide which one of the conflicting operations is accepted, and reject the others as violations of the constraint.

实现这种共识的最常见方法是让单个节点成为领导者,并让它负责做出所有决策。只要您不介意通过单个节点汇集所有请求(即使客户端位于世界的另一端),并且只要该节点不发生故障,这种方法就可以正常工作。如果您需要容忍领导者失败,那么您将再次回到共识问题(请参阅“单领导者复制和共识”)。

The most common way of achieving this consensus is to make a single node the leader, and put it in charge of making all the decisions. That works fine as long as you don’t mind funneling all requests through a single node (even if the client is on the other side of the world), and as long as that node doesn’t fail. If you need to tolerate the leader failing, you’re back at the consensus problem again (see “Single-leader replication and consensus”).

可以通过基于需要唯一的值进行分区来扩展唯一性检查。例如,如果您需要通过请求 ID 来确保唯一性,如示例 12-2所示,您可以确保具有相同请求 ID 的所有请求都路由到同一分区(请参阅 第 6 章)。如果您需要用户名是唯一的,您可以按用户名的哈希进行分区。

Uniqueness checking can be scaled out by partitioning based on the value that needs to be unique. For example, if you need to ensure uniqueness by request ID, as in Example 12-2, you can ensure all requests with the same request ID are routed to the same partition (see Chapter 6). If you need usernames to be unique, you can partition by hash of username.

但是,异步多主复制被排除,因为可能会发生不同主服务器同时接受冲突写入的情况,因此值不再唯一(请参阅 “实现线性化系统”)。如果您希望能够立即拒绝任何违反约束的写入,则同步协调是不可避免的[ 56 ]。

However, asynchronous multi-master replication is ruled out, because it could happen that different masters concurrently accept conflicting writes, and thus the values are no longer unique (see “Implementing Linearizable Systems”). If you want to be able to immediately reject any writes that would violate the constraint, synchronous coordination is unavoidable [56].

基于日志的消息传递的独特性

Uniqueness in log-based messaging

该日志确保所有消费者以相同的顺序看到消息——这种保证正式称为全序广播,相当于共识(请参阅“全序广播”)。在具有基于日志消息传递的非捆绑数据库方法中,我们可以使用非常相似的方法来强制唯一性约束。

The log ensures that all consumers see messages in the same order—a guarantee that is formally known as total order broadcast and is equivalent to consensus (see “Total Order Broadcast”). In the unbundled database approach with log-based messaging, we can use a very similar approach to enforce uniqueness constraints.

流处理器在单个线程上按顺序使用日志分区中的所有消息(请参阅 “日志与传统消息传递的比较”)。因此,如果根据需要唯一的值对日志进行分区,则流处理器可以明确且确定地决定多个冲突操作中的哪一个先出现。例如,在多个用户尝试声明相同的用户名 [ 57 ] 的情况下:

A stream processor consumes all the messages in a log partition sequentially on a single thread (see “Logs compared to traditional messaging”). Thus, if the log is partitioned based on the value that needs to be unique, a stream processor can unambiguously and deterministically decide which one of several conflicting operations came first. For example, in the case of several users trying to claim the same username [57]:

  1. 对用户名的每个请求都被编码为一条消息,并附加到由用户名的哈希确定的分区中。

  2. Every request for a username is encoded as a message, and appended to a partition determined by the hash of the username.

  3. 流处理器顺序读取日志中的请求,使用本地数据库来跟踪所使用的用户名。对于每个可用用户名的请求,它都会记录所采用的名称并向输出流发出成功消息。对于已使用的用户名的每个请求,它都会向输出流发出拒绝消息。

  4. A stream processor sequentially reads the requests in the log, using a local database to keep track of which usernames are taken. For every request for a username that is available, it records the name as taken and emits a success message to an output stream. For every request for a username that is already taken, it emits a rejection message to an output stream.

  5. 请求用户名的客户端监视输出流并等待与其请求相对应的成功或拒绝消息。

  6. The client that requested the username watches the output stream and waits for a success or rejection message corresponding to its request.

该算法与“使用全序广播实现线性化存储”中的算法基本相同。通过增加分区数量,它可以轻松扩展到大请求吞吐量,因为每个分区都可以独立处理。

This algorithm is basically the same as in “Implementing linearizable storage using total order broadcast”. It scales easily to a large request throughput by increasing the number of partitions, as each partition can be processed independently.

该方法不仅适用于唯一性约束,而且适用于许多其他类型的约束。其基本原理是,任何可能发生冲突的写入都被路由到同一分区并按顺序处理。正如“什么是冲突?”中讨论的那样。和 “Write Skew and Phantoms”,冲突的定义可能取决于应用程序,但流处理器可以使用任意逻辑来验证请求。这个想法与 Bayou 在 20 世纪 90 年代首创的方法类似 [ 58 ]。

The approach works not only for uniqueness constraints, but also for many other kinds of constraints. Its fundamental principle is that any writes that may conflict are routed to the same partition and processed sequentially. As discussed in “What is a conflict?” and “Write Skew and Phantoms”, the definition of a conflict may depend on the application, but the stream processor can use arbitrary logic to validate a request. This idea is similar to the approach pioneered by Bayou in the 1990s [58].

多分区请求处理

Multi-partition request processing

当涉及多个分区时,确保操作以原子方式执行,同时满足约束,变得更加有趣。在示例 12-2中,可能存在三个分区:一个包含请求 ID,一个包含收款人帐户,一个包含付款人帐户。这三个东西没有理由应该在同一个分区中,因为它们都是相互独立的。

Ensuring that an operation is executed atomically, while satisfying constraints, becomes more interesting when several partitions are involved. In Example 12-2, there are potentially three partitions: the one containing the request ID, the one containing the payee account, and the one containing the payer account. There is no reason why those three things should be in the same partition, since they are all independent from each other.

在传统的数据库方法中,执行此事务需要跨所有三个分区进行原子提交,这实际上迫使它相对于任何这些分区上的所有其他事务处于全序状态。由于现在存在跨分区协调,不同的分区无法再独立处理,因此吞吐量可能会受到影响。

In the traditional approach to databases, executing this transaction would require an atomic commit across all three partitions, which essentially forces it into a total order with respect to all other transactions on any of those partitions. Since there is now cross-partition coordination, different partitions can no longer be processed independently, so throughput is likely to suffer.

然而,事实证明,使用分区日志可以实现等效的正确性,并且无需原子提交:

However, it turns out that equivalent correctness can be achieved with partitioned logs, and without an atomic commit:

  1. 从账户 A 向账户 B 转账的请求由客户端赋予唯一的请求 ID,并根据请求 ID 附加到日志分区。

  2. The request to transfer money from account A to account B is given a unique request ID by the client, and appended to a log partition based on the request ID.

  3. 流处理器读取请求日志。对于每个请求消息,它都会向输出流发出两条消息:一条发往付款人帐户 A(由 A 分区)的借方指令,以及一条发往收款人帐户 B(由 B 分区)的贷方指令。原始请求 ID 包含在这些发出的消息中。

  4. A stream processor reads the log of requests. For each request message it emits two messages to output streams: a debit instruction to the payer account A (partitioned by A), and a credit instruction to the payee account B (partitioned by B). The original request ID is included in those emitted messages.

  5. 其他处理器使用贷记和借记指令流,按请求 ID 进行重复数据删除,并将更改应用于帐户余额。

  6. Further processors consume the streams of credit and debit instructions, deduplicate by request ID, and apply the changes to the account balances.

步骤 1 和 2 是必要的,因为如果客户端直接发送贷方和借方指令,则需要跨这两个分区进行原子提交,以确保两者都发生或都不发生。为了避免分布式事务的需要,我们首先将请求持久地记录为单个消息,然后从第一条消息中派生出贷方和借方指令。单对象写入在几乎所有数据系统中都是原子的(请参阅“单对象写入”),因此请求要么出现在日志中,要么不出现在日志中,不需要多分区原子提交。

Steps 1 and 2 are necessary because if the client directly sent the credit and debit instructions, it would require an atomic commit across those two partitions to ensure that either both or neither happen. To avoid the need for a distributed transaction, we first durably log the request as a single message, and then derive the credit and debit instructions from that first message. Single-object writes are atomic in almost all data systems (see “Single-object writes”), and so the request either appears in the log or it doesn’t, without any need for a multi-partition atomic commit.

如果流处理器在步骤 2 中崩溃,它将从最后一个检查点恢复处理。这样做时,它不会跳过任何请求消息,但它可能会多次处理请求并产生重复的贷记和借记指令。然而,由于它是确定性的,它只会再次产生相同的指令,并且步骤 3 中的处理器可以使用端到端请求 ID 轻松地删除它们的重复数据。

If the stream processor in step 2 crashes, it resumes processing from its last checkpoint. In doing so, it does not skip any request messages, but it may process requests multiple times and produce duplicate credit and debit instructions. However, since it is deterministic, it will just produce the same instructions again, and the processors in step 3 can easily deduplicate them using the end-to-end request ID.

如果您想确保付款人帐户不会因此次转账而透支,您可以另外拥有一个流处理器(按付款人帐号分区)来维护帐户余额并验证交易。只有有效的交易才会被放入步骤 1 中的请求日志中。

If you want to ensure that the payer account is not overdrawn by this transfer, you can additionally have a stream processor (partitioned by payer account number) that maintains account balances and validates transactions. Only valid transactions would then be placed in the request log in step 1.

通过将多分区事务分解为两个不同的分区阶段并使用端到端请求 ID,我们实现了相同的正确性属性(每个请求仅应用于付款人和收款人帐户一次),即使在存在错误,并且不使用原子提交协议。使用多个不同分区阶段的想法类似于我们在“多分区数据处理”中讨论的内容(另请参阅“并发控制”)。

By breaking down the multi-partition transaction into two differently partitioned stages and using the end-to-end request ID, we have achieved the same correctness property (every request is applied exactly once to both the payer and payee accounts), even in the presence of faults, and without using an atomic commit protocol. The idea of using multiple differently partitioned stages is similar to what we discussed in “Multi-partition data processing” (see also “Concurrency control”).

及时性和完整性

Timeliness and Integrity

事务的一个便利属性是它们通常是可线性化的(请参阅 “线性化”):也就是说,写入者会等待事务提交,此后其写入对所有读取者立即可见。

A convenient property of transactions is that they are typically linearizable (see “Linearizability”): that is, a writer waits until a transaction is committed, and thereafter its writes are immediately visible to all readers.

当跨流处理器的多个阶段分拆操作时,情况并非如此:日志的使用者在设计上是异步的,因此发送者不会等到其消息被使用者处理完毕。但是,客户端可以等待消息出现在输出流上。这就是我们在“基于日志的消息传递中的唯一性”中检查是否满足唯一性约束时所做的事情。

This is not the case when unbundling an operation across multiple stages of stream processors: consumers of a log are asynchronous by design, so a sender does not wait until its message has been processed by consumers. However, it is possible for a client to wait for a message to appear on an output stream. This is what we did in “Uniqueness in log-based messaging” when checking whether a uniqueness constraint was satisfied.

在此示例中,唯一性检查的正确性不取决于消息的发送者是否等待结果。等待的目的只是同步通知发送方唯一性检查是否成功,但这种通知可以与处理消息的效果解耦。

In this example, the correctness of the uniqueness check does not depend on whether the sender of the message waits for the outcome. The waiting only has the purpose of synchronously informing the sender whether or not the uniqueness check succeeded, but this notification can be decoupled from the effects of processing the message.

更一般地说,我认为一致性一词合并了两个值得单独考虑的不同要求:

More generally, I think the term consistency conflates two different requirements that are worth considering separately:

时效性
Timeliness

及时性意味着确保用户观察到系统处于最新状态。我们之前看到,如果用户从过时的数据副本中读取数据,他们可能会观察到数据处于不一致的状态(请参阅“复制滞后问题”)。然而,这种不一致是暂时的,最终只需等待并重试即可解决。

CAP 定理(参见“线性化的成本”)使用线性化意义上的一致性,这是实现及时性的有力方法。较弱的及时性属性,例如写入后读取 一致性(请参阅“读取您自己的写入”)也可能很有用。

Timeliness means ensuring that users observe the system in an up-to-date state. We saw previously that if a user reads from a stale copy of the data, they may observe it in an inconsistent state (see “Problems with Replication Lag”). However, that inconsistency is temporary, and will eventually be resolved simply by waiting and trying again.

The CAP theorem (see “The Cost of Linearizability”) uses consistency in the sense of linearizability, which is a strong way of achieving timeliness. Weaker timeliness properties like read-after-write consistency (see “Reading Your Own Writes”) can also be useful.

正直
Integrity

廉洁就是不腐败;即,没有数据丢失,也没有矛盾或错误的数据。特别是,如果某些派生数据集被维护为某些基础数据的视图(请参阅 “从事件日志派生当前状态”),则派生必须正确。例如,数据库索引必须正确反映数据库的内容——缺少某些记录的索引不是很有用。

如果完整性被破坏,不一致将是永久性的:在大多数情况下,等待并重试并不能修复数据库损坏。相反,需要明确的检查和修复。在 ACID 事务的上下文中(请参阅“ACID 的含义”),一致性通常被理解为某种特定于应用程序的完整性概念。原子性和持久性是保持完整性的重要工具。

Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data. In particular, if some derived dataset is maintained as a view onto some underlying data (see “Deriving current state from the event log”), the derivation must be correct. For example, a database index must correctly reflect the contents of the database—an index in which some records are missing is not very useful.

If integrity is violated, the inconsistency is permanent: waiting and trying again is not going to fix database corruption in most cases. Instead, explicit checking and repair is needed. In the context of ACID transactions (see “The Meaning of ACID”), consistency is usually understood as some kind of application-specific notion of integrity. Atomicity and durability are important tools for preserving integrity.

用口号来说:违反及时性是“最终一致性”,而违反完整性是“永久不一致”。

In slogan form: violations of timeliness are “eventual consistency,” whereas violations of integrity are “perpetual inconsistency.”

我断言,在大多数应用程序中,完整性比及时性重要得多。违反及时性可能会令人烦恼和困惑,但违反完整性可能会带来灾难性的后果。

I am going to assert that in most applications, integrity is much more important than timeliness. Violations of timeliness can be annoying and confusing, but violations of integrity can be catastrophic.

例如,在您的信用卡对账单上,如果您在过去 24 小时内进行的交易尚未出现,这并不奇怪,这些系统存在一定的滞后性是正常的。我们知道,银行对交易的核对和结算是异步的,及时性在这里并不是很重要[ 3 ]。但是,如果账单余额不等于交易金额加上之前的账单余额(金额错误),或者一笔交易已向您收取但未支付给商户(钱消失了),那就非常糟糕了。 )。此类问题将侵犯系统的完整性。

For example, on your credit card statement, it is not surprising if a transaction that you made within the last 24 hours does not yet appear—it is normal that these systems have a certain lag. We know that banks reconcile and settle transactions asynchronously, and timeliness is not very important here [3]. However, it would be very bad if the statement balance was not equal to the sum of the transactions plus the previous statement balance (an error in the sums), or if a transaction was charged to you but not paid to the merchant (disappearing money). Such problems would be violations of the integrity of the system.

数据流系统的正确性

Correctness of dataflow systems

ACID 事务通常提供及时性(例如,线性化)和完整性(例如,原子提交)保证。因此,如果您从 ACID 事务的角度来看待应用程序的正确性,那么及时性和完整性之间的区别就相当无关紧要了。

ACID transactions usually provide both timeliness (e.g., linearizability) and integrity (e.g., atomic commit) guarantees. Thus, if you approach application correctness from the point of view of ACID transactions, the distinction between timeliness and integrity is fairly inconsequential.

另一方面,我们在本章中讨论的基于事件的数据流系统的一个有趣的属性是它们解耦了及时性和完整性。异步处理事件流时,无法保证及时性,除非您显式构建在返回之前等待消息到达的消费者。但完整性实际上是流媒体系统的核心。

On the other hand, an interesting property of the event-based dataflow systems that we have discussed in this chapter is that they decouple timeliness and integrity. When processing event streams asynchronously, there is no guarantee of timeliness, unless you explicitly build consumers that wait for a message to arrive before returning. But integrity is in fact central to streaming systems.

恰好一次有效一次语义(参见“容错”)是一种保持完整性的机制。如果事件丢失,或者事件两次生效,则可能会破坏数据系统的完整性。因此,容错消息传递和重复抑制(例如,幂等操作)对于在面对故障时维持数据系统的完整性非常重要。

Exactly-once or effectively-once semantics (see “Fault Tolerance”) is a mechanism for preserving integrity. If an event is lost, or if an event takes effect twice, the integrity of a data system could be violated. Thus, fault-tolerant message delivery and duplicate suppression (e.g., idempotent operations) are important for maintaining the integrity of a data system in the face of faults.

正如我们在上一节中看到的,可靠的流处理系统可以在不需要分布式事务和原子提交协议的情况下保持完整性,这意味着它们可以以更好的性能和操作稳健性实现相当的正确性。我们通过多种机制的组合实现了这种完整性:

As we saw in the last section, reliable stream processing systems can preserve integrity without requiring distributed transactions and an atomic commit protocol, which means they can potentially achieve comparable correctness with much better performance and operational robustness. We achieved this integrity through a combination of mechanisms:

  • 将写入操作的内容表示为单个消息,可以轻松地以原子方式写入该消息,这种方法非常适合事件溯源(请参阅“事件溯源”

  • Representing the content of the write operation as a single message, which can easily be written atomically—an approach that fits very well with event sourcing (see “Event Sourcing”)

  • 使用确定性派生函数从该单个消息派生所有其他状态更新,类似于存储过程(请参阅“实际串行执行”“作为派生函数的应用程序代码”

  • Deriving all other state updates from that single message using deterministic derivation functions, similarly to stored procedures (see “Actual Serial Execution” and “Application code as a derivation function”)

  • 通过所有这些级别的处理传递客户端生成的请求 ID,从而实现端到端重复抑制和幂等性

  • Passing a client-generated request ID through all these levels of processing, enabling end-to-end duplicate suppression and idempotence

  • 使消息不可变并允许不时重新处理派生数据,这使得更容易从错误中恢复(请参阅“不可变事件的优点”

  • Making messages immutable and allowing derived data to be reprocessed from time to time, which makes it easier to recover from bugs (see “Advantages of immutable events”)

在我看来,这种机制的组合是未来构建容错应用程序的一个非常有前途的方向。

This combination of mechanisms seems to me a very promising direction for building fault-tolerant applications in the future.

松散解释的约束

Loosely interpreted constraints

如前所述,强制执行唯一性约束需要达成共识,通常通过通过单个节点汇集特定分区中的所有事件来实现。如果我们想要传统形式的唯一性约束,这种限制是不可避免的,流处理无法避免它。

As discussed previously, enforcing a uniqueness constraint requires consensus, typically implemented by funneling all events in a particular partition through a single node. This limitation is unavoidable if we want the traditional form of uniqueness constraint, and stream processing cannot avoid it.

然而,另一件事需要认识到,许多实际应用程序实际上可以摆脱更弱的唯一性概念:

However, another thing to realize is that many real applications can actually get away with much weaker notions of uniqueness:

  • 如果两个人同时注册相同的用户名或预订相同的座位,您可以向其中一个人发送消息道歉,并要求他们选择其他人。这种纠正错误的改变称为补偿交易[ 59 , 60 ]。

  • If two people concurrently register the same username or book the same seat, you can send one of them a message to apologize, and ask them to choose a different one. This kind of change to correct a mistake is called a compensating transaction [59, 60].

  • 如果客户订购的商品多于您仓库中的库存,您可以订购更多库存,为延迟向客户道歉,并为他们提供折扣。这实际上与您必须做的事情相同,例如,一辆叉车碾压了您仓库中的一些物品,导致您的库存物品比您想象的要少 [61 ]。因此,无论如何,道歉工作流已经需要成为您的业务流程的一部分,因此可能没有必要对库存商品数量要求线性化约束。

  • If customers order more items than you have in your warehouse, you can order in more stock, apologize to customers for the delay, and offer them a discount. This is actually the same as what you’d have to do if, say, a forklift truck ran over some of the items in your warehouse, leaving you with fewer items in stock than you thought you had [61]. Thus, the apology workflow already needs to be part of your business processes anyway, and so it might be unnecessary to require a linearizable constraint on the number of items in stock.

  • 同样,许多航空公司超额预订飞机,预计一些乘客会错过航班,许多酒店超额预订房间,预计一些客人会取消预订。在这些情况下,出于商业原因故意违反“一人一席”的限制,并采取补偿流程(退款、升级、在邻近酒店提供免费房间)来处理供不应求的情况。即使没有超售,也需要道歉和赔偿程序,以应对因恶劣天气或员工罢工而取消的航班——从此类问题中恢复只是业务的正常部分[3 ]

  • Similarly, many airlines overbook airplanes in the expectation that some passengers will miss their flight, and many hotels overbook rooms, expecting that some guests will cancel. In these cases, the constraint of “one person per seat” is deliberately violated for business reasons, and compensation processes (refunds, upgrades, providing a complimentary room at a neighboring hotel) are put in place to handle situations in which demand exceeds supply. Even if there was no overbooking, apology and compensation processes would be needed in order to deal with flights being cancelled due to bad weather or staff on strike—recovering from such issues is just a normal part of business [3].

  • 如果有人提取的钱多于账户中的金额,银行可以向他们收取透支费,并要求他们偿还所欠的金额。通过限制每天的取款总额,银行的风险受到限制。

  • If someone withdraws more money than they have in their account, the bank can charge them an overdraft fee and ask them to pay back what they owe. By limiting the total withdrawals per day, the risk to the bank is bounded.

在许多商业环境中,暂时违反约束并稍后通过道歉来修复它实际上是可以接受的。道歉的成本(在金钱或声誉方面)各不相同,但通常相当低:您无法取消发送电子邮件,但可以发送包含更正内容的后续电子邮件。如果您不小心向信用卡收取了两次费用,您可以退还其中一笔费用,而您的成本只是手续费,也许还有客户投诉。一旦从ATM机上支付了钱,你就无法直接取回,但原则上,如果账户透支并且客户不肯还款,你可以派收债员来收回钱。

In many business contexts, it is actually acceptable to temporarily violate a constraint and fix it up later by apologizing. The cost of the apology (in terms of money or reputation) varies, but it is often quite low: you can’t unsend an email, but you can send a follow-up email with a correction. If you accidentally charge a credit card twice, you can refund one of the charges, and the cost to you is just the processing fees and perhaps a customer complaint. Once money has been paid out of an ATM, you can’t directly get it back, although in principle you can send debt collectors to recover the money if the account was overdrawn and the customer won’t pay it back.

道歉的成本是否可以接受是一个商业决定。如果可以接受,那么在写入数据之前检查所有约束的传统模型是不必要的限制,并且不需要可线性化的约束。乐观地继续写入并在事后检查约束很可能是一个合理的选择。您仍然可以确保在执行恢复成本高昂的操作之前进行验证,但这并不意味着您必须在写入数据之前进行验证。

Whether the cost of the apology is acceptable is a business decision. If it is acceptable, the traditional model of checking all constraints before even writing the data is unnecessarily restrictive, and a linearizable constraint is not needed. It may well be a reasonable choice to go ahead with a write optimistically, and to check the constraint after the fact. You can still ensure that the validation occurs before doing things that would be expensive to recover from, but that doesn’t imply you must do the validation before you even write the data.

这些应用程序确实需要完整性:您不希望丢失预订,或者由于贷项和借项不匹配而导致资金消失。但他们并不要求及时执行限制:如果你售出的物品多于仓库中的物品,你可以在事后通过道歉来解决问题。这样做类似于我们在“处理写入冲突”中讨论的冲突解决方法。

These applications do require integrity: you would not want to lose a reservation, or have money disappear due to mismatched credits and debits. But they don’t require timeliness on the enforcement of the constraint: if you have sold more items than you have in the warehouse, you can patch up the problem after the fact by apologizing. Doing so is similar to the conflict resolution approaches we discussed in “Handling Write Conflicts”.

避免协调的数据系统

Coordination-avoiding data systems

我们现在做出了两个有趣的观察:

We have now made two interesting observations:

  1. 数据流系统可以维护派生数据的完整性保证,而无需原子提交、线性化或同步跨分区协调。

  2. Dataflow systems can maintain integrity guarantees on derived data without atomic commit, linearizability, or synchronous cross-partition coordination.

  3. 尽管严格的唯一性约束需要及时性和协调性,但许多应用程序实际上可以接受松散的约束,这些约束可能会暂时被违反并在以后修复,只要始终保持完整性即可。

  4. Although strict uniqueness constraints require timeliness and coordination, many applications are actually fine with loose constraints that may be temporarily violated and fixed up later, as long as integrity is preserved throughout.

总而言之,这些观察结果意味着数据流系统可以为许多应用程序提供数据管理服务,而无需协调,同时仍然提供强大的完整性保证。这种避免协调的数据系统有很大的吸引力:它们比需要执行同步协调的系统可以获得更好的性能和容错能力[ 56 ]。

Taken together, these observations mean that dataflow systems can provide the data management services for many applications without requiring coordination, while still giving strong integrity guarantees. Such coordination-avoiding data systems have a lot of appeal: they can achieve better performance and fault tolerance than systems that need to perform synchronous coordination [56].

例如,这样的系统可以在多领导者配置中跨多个数据中心分布式运行,在区域之间异步复制。任何一个数据中心都可以继续独立于其他数据中心运行,因为不需要同步跨区域协调。这样的系统的及时性保证很弱——如果不引入协调,它就无法线性化——但它仍然可以有很强的完整性保证。

For example, such a system could operate distributed across multiple datacenters in a multi-leader configuration, asynchronously replicating between regions. Any one datacenter can continue operating independently from the others, because no synchronous cross-region coordination is required. Such a system would have weak timeliness guarantees—it could not be linearizable without introducing coordination—but it can still have strong integrity guarantees.

在这种情况下,可序列化事务作为维护派生状态的一部分仍然有用,但它们可以在运行良好的小范围内运行[ 8 ]。不需要XA事务等异构分布式事务(参见《分布式事务实践》 )。同步协调仍然可以在需要的地方引入(例如,在无法恢复的操作之前强制执行严格的约束),但是如果只有一小部分,那么没有必要让所有的东西都付出协调的成本。应用程序需要它[ 43 ]。

In this context, serializable transactions are still useful as part of maintaining derived state, but they can be run at a small scope where they work well [8]. Heterogeneous distributed transactions such as XA transactions (see “Distributed Transactions in Practice”) are not required. Synchronous coordination can still be introduced in places where it is needed (for example, to enforce strict constraints before an operation from which recovery is not possible), but there is no need for everything to pay the cost of coordination if only a small part of an application needs it [43].

查看协调和约束的另一种方式:它们减少了由于不一致而必须道歉的次数,但也可能降低系统的性能和可用性,从而可能增加由于中断而必须道歉的次数。您无法将道歉的数量减少到零,但您可以致力于找到满足您需求的最佳权衡 - 既没有太多不一致也没有太多可用性问题的最佳点。

Another way of looking at coordination and constraints: they reduce the number of apologies you have to make due to inconsistencies, but potentially also reduce the performance and availability of your system, and thus potentially increase the number of apologies you have to make due to outages. You cannot reduce the number of apologies to zero, but you can aim to find the best trade-off for your needs—the sweet spot where there are neither too many inconsistencies nor too many availability problems.

信任但要验证

Trust, but Verify

我们所有关于正确性、完整性和容错性的讨论都是基于这样的假设:某些事情可能会出错,但其他事情则不会。我们将这些假设称为我们的系统模型(请参阅“将系统模型映射到现实世界”):例如,我们应该假设进程可能崩溃,机器可能突然断电,网络可能任意延迟或丢弃消息。但我们也可以假设写入磁盘的数据在 后不会丢失fsync,内存中的数据不会损坏,并且 CPU 的乘法指令始终返回正确的结果。

All of our discussion of correctness, integrity, and fault-tolerance has been under the assumption that certain things might go wrong, but other things won’t. We call these assumptions our system model (see “Mapping system models to the real world”): for example, we should assume that processes can crash, machines can suddenly lose power, and the network can arbitrarily delay or drop messages. But we might also assume that data written to disk is not lost after fsync, that data in memory is not corrupted, and that the multiplication instruction of our CPU always returns the correct result.

这些假设是相当合理的,因为它们在大多数情况下都是正确的,如果我们不得不不断担心我们的计算机会犯错误,那么我们将很难完成任何事情。传统上,系统模型对故障采取二元方法:我们假设某些事情可能发生,而另一些事情永远不会发生。事实上,这更多的是一个概率问题:有些事情发生的可能性较大,有些事情发生的可能性较小。问题是违反我们假设的情况是否经常发生,以至于我们在实践中可能会遇到它们。

These assumptions are quite reasonable, as they are true most of the time, and it would be difficult to get anything done if we had to constantly worry about our computers making mistakes. Traditionally, system models take a binary approach toward faults: we assume that some things can happen, and other things can never happen. In reality, it is more a question of probabilities: some things are more likely, other things less likely. The question is whether violations of our assumptions happen often enough that we may encounter them in practice.

我们已经看到,数据在磁盘上未受影响时可能会被损坏(请参阅 “复制和持久性”),并且网络上的数据损坏有时可以逃避 TCP 校验和(请参阅“弱形式的说谎”)。也许这是我们应该更加关注的事情?

We have seen that data can become corrupted while it is sitting untouched on disks (see “Replication and Durability”), and data corruption on the network can sometimes evade the TCP checksums (see “Weak forms of lying”). Maybe this is something we should be paying more attention to?

我过去开发的一个应用程序收集了来自客户端的崩溃报告,而我们收到的一些报告只能通过这些设备内存中的随机位翻转来解释。这似乎不太可能,但如果您有足够的设备运行您的软件,即使是非常不可能的事情也会发生。 除了由于硬件故障或辐射导致的随机内存损坏之外,某些病态内存访问模式甚至可以在没有故障的内存中翻转位[ 62 ]——这种效果可用于破坏操作系统中的安全机制[ 63 ](该技术是称为行锤)。一旦你仔细观察,就会发现硬件并不像看上去那样完美。

One application that I worked on in the past collected crash reports from clients, and some of the reports we received could only be explained by random bit-flips in the memory of those devices. It seems unlikely, but if you have enough devices running your software, even very unlikely things do happen. Besides random memory corruption due to hardware faults or radiation, certain pathological memory access patterns can flip bits even in memory that has no faults [62]—an effect that can be used to break security mechanisms in operating systems [63] (this technique is known as rowhammer). Once you look closely, hardware isn’t quite the perfect abstraction that it may seem.

需要明确的是,随机位翻转在现代硬件上仍然非常罕见[ 64 ]。我只是想指出,它们并非超出可能性范围,因此它们值得关注。

To be clear, random bit-flips are still very rare on modern hardware [64]. I just want to point out that they are not beyond the realm of possibility, and so they deserve some attention.

面对软件错误保持完整性

Maintaining integrity in the face of software bugs

除了此类硬件问题之外,始终存在软件错误的风险,这些错误不会被较低级别的网络、内存或文件系统校验和捕获。即使是广泛使用的数据库软件也存在缺陷:我个人见过 MySQL 未能正确维护唯一性约束的案例 [ 65 ] 以及 PostgreSQL 的可序列化隔离级别表现出写偏斜异常 [ 66 ],尽管 MySQL 和 PostgreSQL 是健壮且备受推崇的数据库经过许多人多年的战斗考验。在不太成熟的软件中,情况可能会更糟。

Besides such hardware issues, there is always the risk of software bugs, which would not be caught by lower-level network, memory, or filesystem checksums. Even widely used database software has bugs: I have personally seen cases of MySQL failing to correctly maintain a uniqueness constraint [65] and PostgreSQL’s serializable isolation level exhibiting write skew anomalies [66], even though MySQL and PostgreSQL are robust and well-regarded databases that have been battle-tested by many people for many years. In less mature software, the situation is likely to be much worse.

尽管在仔细的设计、测试和审查方面付出了巨大的努力,错误仍然会悄悄出现。尽管它们很少见,并且最终会被发现并修复,但在一段时间内,此类错误仍然会损坏数据。

Despite considerable efforts in careful design, testing, and review, bugs still creep in. Although they are rare, and they eventually get found and fixed, there is still a period during which such bugs can corrupt data.

当涉及到应用程序代码时,我们必须假设更多的错误,因为大多数应用程序所接受的审查和测试的数量远不及数据库代码所接受的。许多应用程序甚至没有正确使用数据库提供的用于保持完整性的功能,例如外键或唯一性约束[ 36 ]。

When it comes to application code, we have to assume many more bugs, since most applications don’t receive anywhere near the amount of review and testing that database code does. Many applications don’t even correctly use the features that databases offer for preserving integrity, such as foreign key or uniqueness constraints [36].

ACID 意义上的一致性(参见“一致性”)基于以下思想:数据库以一致状态开始,事务将其从一种一致状态转换为另一种一致状态。因此,我们希望数据库始终处于一致的状态。然而,只有当您假设交易没有错误时,这个概念才有意义。如果应用程序以某种方式错误地使用数据库,例如不安全地使用弱隔离级别,则无法保证数据库的完整性。

Consistency in the sense of ACID (see “Consistency”) is based on the idea that the database starts off in a consistent state, and a transaction transforms it from one consistent state to another consistent state. Thus, we expect the database to always be in a consistent state. However, this notion only makes sense if you assume that the transaction is free from bugs. If the application uses the database incorrectly in some way, for example using a weak isolation level unsafely, the integrity of the database cannot be guaranteed.

不要盲目相信他们的承诺

Don’t just blindly trust what they promise

由于硬件和软件并不总是达到我们希望的理想状态,因此数据损坏似乎迟早是不可避免的。因此,我们至少应该有一种方法来找出数据是否已损坏,以便我们可以修复它并尝试追踪错误的来源。检查数据的完整性称为审核

With both hardware and software not always living up to the ideal that we would like them to be, it seems that data corruption is inevitable sooner or later. Thus, we should at least have a way of finding out if data has been corrupted so that we can fix it and try to track down the source of the error. Checking the integrity of data is known as auditing.

正如“不可变事件的优点”中所讨论的,审计不仅仅适用于财务应用。然而,可审计性在金融领域非常重要,因为每个人都知道错误会发生,而且我们都认识到需要能够发现和解决问题。

As discussed in “Advantages of immutable events”, auditing is not just for financial applications. However, auditability is highly important in finance precisely because everyone knows that mistakes happen, and we all recognize the need to be able to detect and fix problems.

成熟的系统同样倾向于考虑不太可能出现问题的可能性,并管理该风险。例如,HDFS 和 Amazon S3 等大型存储系统并不完全信任磁盘:它们运行后台进程,不断读回文件,将它们与其他副本进行比较,并将文件从一个磁盘移动到另一个磁盘,以减轻磁盘损坏的风险。沉默腐败的风险[ 67 ]。

Mature systems similarly tend to consider the possibility of unlikely things going wrong, and manage that risk. For example, large-scale storage systems such as HDFS and Amazon S3 do not fully trust disks: they run background processes that continually read back files, compare them to other replicas, and move files from one disk to another, in order to mitigate the risk of silent corruption [67].

如果您想确保您的数据仍然存在,您必须实际读取并检查它。大多数时候它仍然存在,但如果不存在,你真的想早点发现而不是晚点发现。同样的道理,时不时地尝试从备份中恢复也很重要,否则您可能会发现备份已损坏,但为时已晚,而且数据已经丢失。不要只是盲目地相信这一切都有效。

If you want to be sure that your data is still there, you have to actually read it and check. Most of the time it will still be there, but if it isn’t, you really want to find out sooner rather than later. By the same argument, it is important to try restoring from your backups from time to time—otherwise you may only find out that your backup is broken when it is too late and you have already lost data. Don’t just blindly trust that it is all working.

验证文化

A culture of verification

像 HDFS 和 S3 这样的系统仍然必须假设磁盘在大多数时间都正常工作——这是一个合理的假设,但与假设它们始终正常工作不同。然而,目前没有多少系统具有这种“信任但验证”的持续自我审计方法。许多人认为正确性保证是绝对的,并且没有为罕见的数据损坏的可能性做好准备。我希望将来我们会看到更多的自我验证自我审计系统,不断检查自己的完整性,而不是依赖盲目信任[ 68 ]。

Systems like HDFS and S3 still have to assume that disks work correctly most of the time—which is a reasonable assumption, but not the same as assuming that they always work correctly. However, not many systems currently have this kind of “trust, but verify” approach of continually auditing themselves. Many assume that correctness guarantees are absolute and make no provision for the possibility of rare data corruption. I hope that in the future we will see more self-validating or self-auditing systems that continually check their own integrity, rather than relying on blind trust [68].

我担心 ACID 数据库的文化导致我们在盲目信任技术(例如事务机制)的基础上开发应用程序,而忽略了过程中的任何形式的可审计性。由于我们信任的技术在大多数情况下都运行良好,因此审计机制被认为不值得投资。

I fear that the culture of ACID databases has led us toward developing applications on the basis of blindly trusting technology (such as a transaction mechanism), and neglecting any sort of auditability in the process. Since the technology we trusted worked well enough most of the time, auditing mechanisms were not deemed worth the investment.

但随后数据库格局发生了变化:较弱的一致性保证在 NoSQL 的旗帜下成为常态,不太成熟的存储技术被广泛使用。然而,由于审核机制尚未开发出来,我们继续在盲目信任的基础上构建应用程序,尽管这种方法现在变得更加危险。让我们考虑一下可审计性的设计。

But then the database landscape changed: weaker consistency guarantees became the norm under the banner of NoSQL, and less mature storage technologies became widely used. Yet, because the audit mechanisms had not been developed, we continued building applications on the basis of blind trust, even though this approach had now become more dangerous. Let’s think for a moment about designing for auditability.

可审计性设计

Designing for auditability

如果一个事务改变了数据库中的多个对象,则事后很难判断该事务的含义。即使您捕获事务日志(请参阅“更改数据捕获”),各个表中的插入、更新和删除也不一定能够清楚地说明 执行这些突变的原因。决定这些突变的应用程序逻辑的调用是暂时的并且无法重现。

If a transaction mutates several objects in a database, it is difficult to tell after the fact what that transaction means. Even if you capture the transaction logs (see “Change Data Capture”), the insertions, updates, and deletions in various tables do not necessarily give a clear picture of why those mutations were performed. The invocation of the application logic that decided on those mutations is transient and cannot be reproduced.

相比之下,基于事件的系统可以提供更好的可审计性。在事件溯源方法中,系统的用户输入表示为单个不可变事件,并且任何结果状态更新都从该事件派生。可以使推导具有确定性和可重复性,以便通过相同版本的推导代码运行相同的事件日志将导致相同的状态更新。

By contrast, event-based systems can provide better auditability. In the event sourcing approach, user input to the system is represented as a single immutable event, and any resulting state updates are derived from that event. The derivation can be made deterministic and repeatable, so that running the same log of events through the same version of the derivation code will result in the same state updates.

明确数据流(参见“批处理输出的原理”)可以使数据的来源更加清晰,从而使完整性检查更加可行。对于事件日志,我们可以使用哈希值来检查事件存储是否未被损坏。对于任何派生状态,我们可以重新运行从事件日志派生它的批处理和流处理器,以检查是否得到相同的结果,甚至并行运行冗余派生。

Being explicit about dataflow (see “Philosophy of batch process outputs”) makes the provenance of data much clearer, which makes integrity checking much more feasible. For the event log, we can use hashes to check that the event storage has not been corrupted. For any derived state, we can rerun the batch and stream processors that derived it from the event log in order to check whether we get the same result, or even run a redundant derivation in parallel.

确定性且定义良好的数据流还可以更轻松地调试和跟踪系统的执行,以确定其执行某些操作的原因 [ 4 , 69 ]。如果发生意外情况,拥有诊断能力来重现导致意外事件的确切情况(一种时间旅行调试能力)是很有价值的。

A deterministic and well-defined dataflow also makes it easier to debug and trace the execution of a system in order to determine why it did something [4, 69]. If something unexpected occurred, it is valuable to have the diagnostic capability to reproduce the exact circumstances that led to the unexpected event—a kind of time-travel debugging capability.

再次端到端的争论

The end-to-end argument again

如果我们不能完全相信系统的每个单独组件都不会受到损坏(每个硬件都没有故障并且每个软件都没有错误),那么我们至少必须定期检查数据的完整性。如果我们不进行检查,我们就无法发现腐败问题,直到为时已晚,腐败已经造成了一些下游损害,此时追查问题将变得更加困难且成本更高。

If we cannot fully trust that every individual component of the system will be free from corruption—that every piece of hardware is fault-free and that every piece of software is bug-free—then we must at least periodically check the integrity of our data. If we don’t check, we won’t find out about corruption until it is too late and it has caused some downstream damage, at which point it will be much harder and more expensive to track down the problem.

检查数据系统的完整性最好以端到端的方式完成(请参阅 “数据库的端到端论证”):我们可以在完整性检查中包含的系统越多,损坏的机会就越少在过程的某个阶段不被注意。如果我们可以检查整个派生数据管道是否端到端正确,那么路径上的任何磁盘、网络、服务和算法都将隐式包含在检查中。

Checking the integrity of data systems is best done in an end-to-end fashion (see “The End-to-End Argument for Databases”): the more systems we can include in an integrity check, the fewer opportunities there are for corruption to go unnoticed at some stage of the process. If we can check that an entire derived data pipeline is correct end to end, then any disks, networks, services, and algorithms along the path are implicitly included in the check.

持续的端到端完整性检查可以增强您对系统正确性的信心,从而使您的行动更快[ 70 ]。与自动化测试一样,审核增加了快速发现错误的机会,从而降低了系统更改或新存储技术造成损害的风险。如果您不害怕进行更改,则可以更好地改进应用程序以满足不断变化的需求。

Having continuous end-to-end integrity checks gives you increased confidence about the correctness of your systems, which in turn allows you to move faster [70]. Like automated testing, auditing increases the chances that bugs will be found quickly, and thus reduces the risk that a change to the system or a new storage technology will cause damage. If you are not afraid of making changes, you can much better evolve an application to meet changing requirements.

可审计数据系统的工具

Tools for auditable data systems

目前,没有多少数据系统将可审计性作为首要关注点。一些应用程序实现自己的审计机制,例如将所有更改记录到单独的审计表中,但保证审计日志和数据库状态的完整性仍然很困难。通过使用硬件安全模块定期对其进行签名,可以使事务日志防篡改,但这并不能保证正确的事务首先进入日志。

At present, not many data systems make auditability a top-level concern. Some applications implement their own audit mechanisms, for example by logging all changes to a separate audit table, but guaranteeing the integrity of the audit log and the database state is still difficult. A transaction log can be made tamper-proof by periodically signing it with a hardware security module, but that does not guarantee that the right transactions went into the log in the first place.

使用加密工具以对各种硬件和软件问题甚至潜在恶意行为具有鲁棒性的方式来证明系统的完整性将会很有趣。加密货币、区块链和分布式账本技术(例如比特币、以太坊、Ripple、Stellar 和其他各种技术[ 71、72、73 ])雨后春笋般涌现,以探索这一领域 。

It would be interesting to use cryptographic tools to prove the integrity of a system in a way that is robust to a wide range of hardware and software issues, and even potentially malicious actions. Cryptocurrencies, blockchains, and distributed ledger technologies such as Bitcoin, Ethereum, Ripple, Stellar, and various others [71, 72, 73] have sprung up to explore this area.

我没有资格评论这些技术作为货币或合同机制的优点。然而,从数据系统的角度来看,它们包含一些有趣的想法。本质上,它们是分布式数据库,具有数据模型和事务机制,其中不同的副本可以由相互不信任的组织托管。副本不断检查彼此的完整性,并使用共识协议就应执行的事务达成一致。

I am not qualified to comment on the merits of these technologies as currencies or mechanisms for agreeing contracts. However, from a data systems point of view they contain some interesting ideas. Essentially, they are distributed databases, with a data model and transaction mechanism, in which different replicas can be hosted by mutually untrusting organizations. The replicas continually check each other’s integrity and use a consensus protocol to agree on the transactions that should be executed.

我对这些技术的拜占庭容错方面有些怀疑(参见 “拜占庭错误” ),并且我发现工作量证明技术(例如比特币挖矿)非常浪费。比特币的交易吞吐量相当低,尽管更多的是出于政治和经济原因而不是技术原因。然而,完整性检查方面很有趣。

I am somewhat skeptical about the Byzantine fault tolerance aspects of these technologies (see “Byzantine Faults”), and I find the technique of proof of work (e.g., Bitcoin mining) extraordinarily wasteful. The transaction throughput of Bitcoin is rather low, albeit for political and economic reasons more than for technical ones. However, the integrity checking aspects are interesting.

加密审计和完整性检查通常依赖于Merkle 树 [ 74 ],它们是哈希树,可用于有效地证明记录出现在某些数据集(以及其他一些事物)中。除了加密货币的炒作之外,证书透明度是一种依赖 Merkle 树来检查 TLS/SSL 证书有效性的安全技术 [ 75 , 76 ]。

Cryptographic auditing and integrity checking often relies on Merkle trees [74], which are trees of hashes that can be used to efficiently prove that a record appears in some dataset (and a few other things). Outside of the hype of cryptocurrencies, certificate transparency is a security technology that relies on Merkle trees to check the validity of TLS/SSL certificates [75, 76].

我可以想象完整性检查和审计算法,例如证书透明度和分布式账本的算法,将在一般数据系统中得到更广泛的使用。需要做一些工作来使它们与没有加密审计的系统具有同等的可扩展性,并尽可能降低性能损失。但我认为这是未来值得关注的一个有趣领域。

I could imagine integrity-checking and auditing algorithms, like those of certificate transparency and distributed ledgers, becoming more widely used in data systems in general. Some work will be needed to make them equally scalable as systems without cryptographic auditing, and to keep the performance penalty as low as possible. But I think this is an interesting area to watch in the future.

做正确的事

Doing the Right Thing

在本书的最后一部分,我想退后一步。在本书中,我们研究了各种不同的数据系统架构,评估了它们的优缺点,并探索了构建可靠、可扩展和可维护的应用程序的技术。然而,我们遗漏了讨论中一个重要且基本的部分,我现在想补充一下。

In the final section of this book, I would like to take a step back. Throughout this book we have examined a wide range of different architectures for data systems, evaluated their pros and cons, and explored techniques for building reliable, scalable, and maintainable applications. However, we have left out an important and fundamental part of the discussion, which I would now like to fill in.

每个系统都是为了特定目的而构建的;我们采取的每一项行动都会产生有意和无意的后果。目的可能就像赚钱一样简单,但对世界的影响可能远远超出最初的目的。我们,构建这些系统的工程师,有责任仔细考虑这些后果,并有意识地决定我们想要生活在什么样的世界中。

Every system is built for a purpose; every action we take has both intended and unintended consequences. The purpose may be as simple as making money, but the consequences for the world may reach far beyond that original purpose. We, the engineers building these systems, have a responsibility to carefully consider those consequences and to consciously decide what kind of world we want to live in.

我们将数据视为抽象的事物,但请记住,许多数据集都是关于人的:他们的行为、他们的兴趣、他们的身份。我们必须以人道和尊重的态度对待这些数据。用户也是人,人的尊严至关重要。

We talk about data as an abstract thing, but remember that many datasets are about people: their behavior, their interests, their identity. We must treat such data with humanity and respect. Users are humans too, and human dignity is paramount.

软件开发越来越涉及重要的道德选择。有一些指南可以帮助软件工程师解决这些问题,例如 ACM 的软件工程道德规范和专业实践 [ 77 ],但它们很少在实践中讨论、应用和执行。因此,工程师和产品经理有时对隐私及其产品的潜在负面后果采取非常漫不经心的态度 [ 78,79,80 ]

Software development increasingly involves making important ethical choices. There are guidelines to help software engineers navigate these issues, such as the ACM’s Software Engineering Code of Ethics and Professional Practice [77], but they are rarely discussed, applied, and enforced in practice. As a result, engineers and product managers sometimes take a very cavalier attitude to privacy and potential negative consequences of their products [78, 79, 80].

一项技术本身并没有好坏之分,重要的是它如何使用以及它如何影响人们。这对于像搜索引擎这样的软件系统来说是正确的,就像对于像枪这样的武器一样。我认为软件工程师仅仅关注技术而忽视其后果是不够的:道德责任也是我们要承担的。关于道德的推理很困难,但它太重要了,不容忽视。

A technology is not good or bad in itself—what matters is how it is used and how it affects people. This is true for a software system like a search engine in much the same way as it is for a weapon like a gun. I think it is not sufficient for software engineers to focus exclusively on the technology and ignore its consequences: the ethical responsibility is ours to bear also. Reasoning about ethics is difficult, but it is too important to ignore.

预测分析

Predictive Analytics

例如,预测分析是“大数据”炒作的主要部分。使用数据分析来预测天气或疾病的传播是一回事[ 81 ];预测罪犯是否可能再次犯罪、贷款申请人是否可能违约、或者保险客户是否可能提出昂贵的索赔是另一回事。后者直接影响个人的生活。

For example, predictive analytics is a major part of the “Big Data” hype. Using data analysis to predict the weather, or the spread of diseases, is one thing [81]; it is another matter to predict whether a convict is likely to reoffend, whether an applicant for a loan is likely to default, or whether an insurance customer is likely to make expensive claims. The latter have a direct effect on individual people’s lives.

当然,支付网络希望防止欺诈交易,银行希望避免不良贷款,航空公司希望避免劫持,公司希望避免雇用效率低下或不值得信赖的人员。从他们的角度来看,错失商机的成本很低,但不良贷款或有问题的员工的成本要高得多,因此组织自然要保持谨慎。如果有疑问,他们最好说“不”。

Naturally, payment networks want to prevent fraudulent transactions, banks want to avoid bad loans, airlines want to avoid hijackings, and companies want to avoid hiring ineffective or untrustworthy people. From their point of view, the cost of a missed business opportunity is low, but the cost of a bad loan or a problematic employee is much higher, so it is natural for organizations to want to be cautious. If in doubt, they are better off saying no.

然而,随着算法决策变得越来越普遍,被某种算法(准确或错误地)标记为有风险的人可能会遭受大量“否”决策。系统性地被排除在工作、航空旅行、保险、财产租赁、金融服务和社会其他关键方面之外,是对个人自由的巨大限制,以至于被称为“算法监狱”[82 ]。在尊重人权的国家,刑事司法系统在被证明有罪之前假定无罪;另一方面,自动化系统可以系统地、任意地排除一个人参与社会,而无需提供任何有罪证据,也几乎没有上诉的机会。

However, as algorithmic decision-making becomes more widespread, someone who has (accurately or falsely) been labeled as risky by some algorithm may suffer a large number of those “no” decisions. Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial services, and other key aspects of society is such a large constraint of the individual’s freedom that it has been called “algorithmic prison” [82]. In countries that respect human rights, the criminal justice system presumes innocence until proven guilty; on the other hand, automated systems can systematically and arbitrarily exclude a person from participating in society without any proof of guilt, and with little chance of appeal.

偏见和歧视

Bias and discrimination

算法做出的决策不一定比人类做出的决策更好或更差。每个人都可能存在偏见,即使他们积极地试图消除偏见,而且歧视性做法可能会在文化上制度化。人们希望,基于数据的决策,而不是人们的主观和本能评估,可以更加公平,并为传统系统中经常被忽视的人们提供更好的机会[83 ]

Decisions made by an algorithm are not necessarily any better or any worse than those made by a human. Every person is likely to have biases, even if they actively try to counteract them, and discriminatory practices can become culturally institutionalized. There is hope that basing decisions on data, rather than subjective and instinctive assessments by people, could be more fair and give a better chance to people who are often overlooked in the traditional system [83].

当我们开发预测分析系统时,我们不仅仅是通过使用软件指定何时说是或否的规则来自动化人类的决策;我们甚至让规则本身从数据中推断出来。然而,这些系统学到的模式是不透明的:即使数据中存在一些相关性,我们也可能不知道为什么。如果算法的输入存在系统偏差,系统很可能会学习并放大其输出中的偏差[ 84 ]。

When we develop predictive analytics systems, we are not merely automating a human’s decision by using software to specify the rules for when to say yes or no; we are even leaving the rules themselves to be inferred from data. However, the patterns learned by these systems are opaque: even if there is some correlation in the data, we may not know why. If there is a systematic bias in the input to an algorithm, the system will most likely learn and amplify that bias in its output [84].

在许多国家,反歧视法禁止根据受保护的特征(如种族、年龄、性别、性取向、残疾或信仰)区别对待人们。可以分析个人数据的其他特征,但如果它们与受保护的特征相关,会发生什么?例如,在种族隔离的社区中,一个人的邮政编码甚至他们的 IP 地址都是种族的有力预测因素。这样说来,相信算法可以以某种方式将有偏差的数据作为输入并从中产生公平公正的输出似乎是荒谬的[ 85 ]。然而,这种信念似乎经常被数据驱动决策的支持者所暗示,这种态度被讽刺为“机器学习就像为偏见洗钱”[86 ]

In many countries, anti-discrimination laws prohibit treating people differently depending on protected traits such as ethnicity, age, gender, sexuality, disability, or beliefs. Other features of a person’s data may be analyzed, but what happens if they are correlated with protected traits? For example, in racially segregated neighborhoods, a person’s postal code or even their IP address is a strong predictor of race. Put like this, it seems ridiculous to believe that an algorithm could somehow take biased data as input and produce fair and impartial output from it [85]. Yet this belief often seems to be implied by proponents of data-driven decision making, an attitude that has been satirized as “machine learning is like money laundering for bias” [86].

预测分析系统只是根据过去进行推断;如果过去存在歧视,他们就会将这种歧视编入法典。如果我们希望未来比过去更好,就需要道德想象力,而这只有人类才能提供[ 87 ]。数据和模型应该是我们的工具,而不是我们的主人。

Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they codify that discrimination. If we want the future to be better than the past, moral imagination is required, and that’s something only humans can provide [87]. Data and models should be our tools, not our masters.

责任与问责

Responsibility and accountability

自动化决策提出了责任和问责制的问题[ 87 ]。如果一个人犯了错误,他们可以被追究责任,并且受该决定影响的人可以上诉。算法也会出错,但如果出错了谁来负责[ 88 ]?当自动驾驶汽车发生事故时,谁负责?如果自动信用评分算法系统性地歧视特定种族或宗教的人,有什么办法吗?如果您的机器学习系统的决定受到司法审查,您能否向法官解释算法是如何做出决定的?

Automated decision making opens the question of responsibility and accountability [87]. If a human makes a mistake, they can be held accountable, and the person affected by the decision can appeal. Algorithms make mistakes too, but who is accountable if they go wrong [88]? When a self-driving car causes an accident, who is responsible? If an automated credit scoring algorithm systematically discriminates against people of a particular race or religion, is there any recourse? If a decision by your machine learning system comes under judicial review, can you explain to the judge how the algorithm made its decision?

信用评级机构是收集数据以做出有关人员决策的一个古老例子。不良的信用评分使生活变得困难,但至少信用评分通常是基于一个人实际借贷历史的相关事实,并且记录中的任何错误都可以纠正(尽管机构通常不会让这变得容易)。然而,基于机器学习的评分算法通常使用更广泛的输入,并且更加不透明,使得更难理解特定决策是如何产生的以及某人是否受到不公平或歧视性的对待[89 ]

Credit rating agencies are an old example of collecting data to make decisions about people. A bad credit score makes life difficult, but at least a credit score is normally based on relevant facts about a person’s actual borrowing history, and any errors in the record can be corrected (although the agencies normally do not make this easy). However, scoring algorithms based on machine learning typically use a much wider range of inputs and are much more opaque, making it harder to understand how a particular decision has come about and whether someone is being treated in an unfair or discriminatory way [89].

信用评分总结了“您过去的行为如何?” 而预测分析通常基于“谁与你相似,以及像你这样的人过去的行为如何?” 与他人的行为进行比较意味着对人们进行刻板印象,例如基于他们居住的地方(种族和社会经济阶层的密切代表)。那些被放在错误的桶里的人怎么办?此外,如果由于错误数据而导致决策不正确,则几乎不可能追索[ 87 ]。

A credit score summarizes “How did you behave in the past?” whereas predictive analytics usually work on the basis of “Who is similar to you, and how did people like you behave in the past?” Drawing parallels to others’ behavior implies stereotyping people, for example based on where they live (a close proxy for race and socioeconomic class). What about people who get put in the wrong bucket? Furthermore, if a decision is incorrect due to erroneous data, recourse is almost impossible [87].

许多数据本质上是统计数据,这意味着即使概率分布总体上是正确的,个别情况也可能是错误的。例如,如果您所在国家/地区的平均预期寿命为 80 岁,这并不意味着您会在 80 岁生日时去世。从平均值和概率分布来看,你无法透露一个人的寿命。同样,预测系统的输出也是概率性的,在个别情况下很可能是错误的。

Much data is statistical in nature, which means that even if the probability distribution on the whole is correct, individual cases may well be wrong. For example, if the average life expectancy in your country is 80 years, that doesn’t mean you’re expected to drop dead on your 80th birthday. From the average and the probability distribution, you can’t say much about the age to which one particular person will live. Similarly, the output of a prediction system is probabilistic and may well be wrong in individual cases.

盲目相信数据对于决策的至高无上不仅是妄想,而且是非常危险的。随着数据驱动的决策变得越来越普遍,我们需要弄清楚如何使算法负责任和透明,如何避免强化现有偏见,以及如何在不可避免地犯错误时纠正它们。

A blind belief in the supremacy of data for making decisions is not only delusional, it is positively dangerous. As data-driven decision making becomes more widespread, we will need to figure out how to make algorithms accountable and transparent, how to avoid reinforcing existing biases, and how to fix them when they inevitably make mistakes.

我们还需要弄清楚如何防止数据被用来伤害人们,并发挥其积极潜力。例如,分析可以揭示人们生活的财务和社会特征。一方面,这种权力可以用来集中提供援助和支持,帮助那些最需要帮助的人。另一方面,它有时被掠夺性企业用来识别弱势群体并向他们出售高风险产品,例如高成本贷款和毫无价值的大学学位[87 , 90 ]

We will also need to figure out how to prevent data being used to harm people, and realize its positive potential instead. For example, analytics can reveal financial and social characteristics of people’s lives. On the one hand, this power could be used to focus aid and support to help those people who most need it. On the other hand, it is sometimes used by predatory business seeking to identify vulnerable people and sell them risky products such as high-cost loans and worthless college degrees [87, 90].

反馈回路

Feedback loops

即使对于对人们产生不太直接深远影响的预测应用程序(例如推荐系统),我们也必须面对一些难题。当服务能够很好地预测用户想看什么内容时,它们最终可能只会向人们展示他们已经同意的观点,从而导致回声室,在其中滋生刻板印象、错误信息和两极分化。我们已经看到社交媒体回音室对竞选活动的影响[ 91 ]。

Even with predictive applications that have less immediately far-reaching effects on people, such as recommendation systems, there are difficult issues that we must confront. When services become good at predicting what content users want to see, they may end up showing people only opinions they already agree with, leading to echo chambers in which stereotypes, misinformation, and polarization can breed. We are already seeing the impact of social media echo chambers on election campaigns [91].

当预测分析影响人们的生活时,由于自我强化的反馈循环,会出现特别有害的问题。例如,考虑雇主使用信用评分来评估潜在雇员的情况。您可能是一个拥有良好信用评分的好工人,但突然发现自己由于无法控制的不幸而陷入财务困难。当您错过账单付款时,您的信用评分就会受到影响,并且您找到工作的可能性也会降低。失业会让你陷入贫困,这会进一步降低你的分数,让你更难找到工作[ 87 ]。这是由于隐藏在数学严谨性和数据背后的有毒假设而导致的螺旋式下降。

When predictive analytics affect people’s lives, particularly pernicious problems arise due to self-reinforcing feedback loops. For example, consider the case of employers using credit scores to evaluate potential hires. You may be a good worker with a good credit score, but suddenly find yourself in financial difficulties due to a misfortune outside of your control. As you miss payments on your bills, your credit score suffers, and you will be less likely to find work. Joblessness pushes you toward poverty, which further worsens your scores, making it even harder to find employment [87]. It’s a downward spiral due to poisonous assumptions, hidden behind a camouflage of mathematical rigor and data.

我们无法总是预测这种反馈循环何时发生。然而,通过思考整个系统(不仅是计算机化的部分,还包括与之交互的人),可以预测许多后果——这种方法称为系统思维 [ 92 ]。我们可以尝试了解数据分析系统如何响应不同的行为、结构或特征。该制度是否会加强和扩大人与人之间现有的差异(例如,使富人更富或穷人更穷),还是试图消除不公正?即使有最好的意图,我们也必须警惕意想不到的后果。

We can’t always predict when such feedback loops happen. However, many consequences can be predicted by thinking about the entire system (not just the computerized parts, but also the people interacting with it)—an approach known as systems thinking [92]. We can try to understand how a data analysis system responds to different behaviors, structures, or characteristics. Does the system reinforce and amplify existing differences between people (e.g., making the rich richer or the poor poorer), or does it try to combat injustice? And even with the best intentions, we must beware of unintended consequences.

隐私和跟踪

Privacy and Tracking

除了预测分析的问题(即使用数据对人做出自动决策)之外,数据收集本身还存在道德问题。收集数据的组织和被收集数据的人之间是什么关系?

Besides the problems of predictive analytics—i.e., using data to make automated decisions about people—there are ethical problems with data collection itself. What is the relationship between the organizations collecting data and the people whose data is being collected?

当系统只存储用户明确输入的数据时,因为他们希望系统以某种方式存储和处理这些数据,那么系统正在为用户执行服务:用户就是客户。但是,当用户的活动被跟踪并记录为他们正在做的其他事情的副作用时,这种关系就不太清楚了。服务不再只做用户告诉它做的事情,而是承担自己的利益,这可能与用户的利益发生冲突。

When a system only stores data that a user has explicitly entered, because they want the system to store and process it in a certain way, the system is performing a service for the user: the user is the customer. But when a user’s activity is tracked and logged as a side effect of other things they are doing, the relationship is less clear. The service no longer just does what the user tells it to do, but it takes on interests of its own, which may conflict with the user’s interests.

跟踪行为数据对于许多在线服务面向用户的功能变得越来越重要:跟踪哪些搜索结果被点击有助于提高搜索结果的排名;推荐“喜欢X的人也喜欢Y”,帮助用户发现有趣、有用的东西;A/B 测试和用户流程分析可以帮助指示如何改进用户界面。这些功能需要对用户行为进行一定程度的跟踪,用户可以从中受益。

Tracking behavioral data has become increasingly important for user-facing features of many online services: tracking which search results are clicked helps improve the ranking of search results; recommending “people who liked X also liked Y” helps users discover interesting and useful things; A/B tests and user flow analysis can help indicate how a user interface might be improved. Those features require some amount of tracking of user behavior, and users benefit from them.

然而,根据公司的业务模式,跟踪通常并不止于此。如果服务是通过广告来资助的,那么广告商就是实际的客户,用户的利益是其次的。跟踪数据变得更加详细,分析变得更加深入,并且数据被保留很长时间,以便为营销目的建立每个人的详细档案。

However, depending on a company’s business model, tracking often doesn’t stop there. If the service is funded through advertising, the advertisers are the actual customers, and the users’ interests take second place. Tracking data becomes more detailed, analyses become further-reaching, and data is retained for a long time in order to build up detailed profiles of each person for marketing purposes.

现在,公司和被收集数据的用户之间的关系开始看起来完全不同。用户获得免费服务并被吸引尽可能多地参与其中。对用户的跟踪主要不是服务于该个人,而是服务于资助该服务的广告商的需求。我想,这种关系可以用一个更险恶的词来形容:监视

Now the relationship between the company and the user whose data is being collected starts looking quite different. The user is given a free service and is coaxed into engaging with it as much as possible. The tracking of the user serves not primarily that individual, but rather the needs of the advertisers who are funding the service. I think this relationship can be appropriately described with a word that has more sinister connotations: surveillance.

监视

Surveillance

作为一个思想实验,尝试用监视替换 单词数据,并观察常见短语听起来是否仍然那么好[ 93 ]。怎么样:“在我们的监控驱动的组织中,我们收集实时监控流并将它们存储在我们的监控仓库中。我们的监视科学家使用先进的分析和监视处理来获得新的见解。”

As a thought experiment, try replacing the word data with surveillance, and observe if common phrases still sound so good [93]. How about this: “In our surveillance-driven organization we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing in order to derive new insights.”

这个思想实验对于《设计监视密集型应用程序》这本书来说是异常激烈的争论,但我认为需要用强有力的措辞来强调这一点。在我们试图让软件“吞噬世界”[ 94 ]的过程中,我们建立了世界上有史以来最强大的大规模监控基础设施。奔向物联网,我们正在迅速接近这样一个世界:每个居住空间都至少包含一个联网麦克风,其形式包括智能手机、智能电视、语音控制辅助设备、婴儿监视器,甚至儿童玩具。使用基于云的语音识别。其中许多设备都有糟糕的安全记录[ 95 ]。

This thought experiment is unusually polemic for this book, Designing Surveillance-Intensive Applications, but I think that strong words are needed to emphasize this point. In our attempts to make software “eat the world” [94], we have built the greatest mass surveillance infrastructure the world has ever seen. Rushing toward an Internet of Things, we are rapidly approaching a world in which every inhabited space contains at least one internet-connected microphone, in the form of smartphones, smart TVs, voice-controlled assistant devices, baby monitors, and even children’s toys that use cloud-based speech recognition. Many of these devices have a terrible security record [95].

即使是最极权和专制的政权也只能梦想在每个房间都安装麦克风,并迫使每个人始终携带能够跟踪其位置和活动的设备。然而,我们显然是自愿的,甚至是热情的,将自己投入到这个全面监视的世界中。区别只是数据是由公司而不是政府机构收集的[ 96 ]。

Even the most totalitarian and repressive regimes could only dream of putting a microphone in every room and forcing every person to constantly carry a device capable of tracking their location and movements. Yet we apparently voluntarily, even enthusiastically, throw ourselves into this world of total surveillance. The difference is just that the data is being collected by corporations rather than government agencies [96].

并非所有数据收集都一定符合监视的条件,但对其进行检查可以帮助我们了解我们与数据收集者的关系。为什么我们似乎乐于接受企业的监视?也许你觉得自己没有什么可隐瞒的——换句话说,你完全符合现有的权力结构,你不是边缘化的少数人,你不必害怕迫害[97 ]。并不是每个人都这么幸运。或者也许是因为目的看起来是良性的——它不是公开的强制和一致性,而只是更好的推荐和更个性化的营销。然而,结合上一节对预测分析的讨论,这种区别似乎不太明显。

Not all data collection necessarily qualifies as surveillance, but examining it as such can help us understand our relationship with the data collector. Why are we seemingly happy to accept surveillance by corporations? Perhaps you feel you have nothing to hide—in other words, you are totally in line with existing power structures, you are not a marginalized minority, and you needn’t fear persecution [97]. Not everyone is so fortunate. Or perhaps it’s because the purpose seems benign—it’s not overt coercion and conformance, but merely better recommendations and more personalized marketing. However, combined with the discussion of predictive analytics from the last section, that distinction seems less clear.

我们已经看到汽车保险费与汽车中的跟踪设备相关,而健康保险覆盖范围取决于佩戴健身跟踪设备的人。当监视被用来确定影响生活重要方面的事情时,例如保险范围或就业,它开始显得不那么良性。此外,数据分析可以揭示令人惊讶的侵入性事物:例如,智能手表或健身追踪器中的运动传感器可用于以相当高的精度计算出您正在输入的内容(例如密码)[98 ]。分析算法只会变得更好。

We are already seeing car insurance premiums linked to tracking devices in cars, and health insurance coverage that depends on people wearing a fitness tracking device. When surveillance is used to determine things that hold sway over important aspects of life, such as insurance coverage or employment, it starts to appear less benign. Moreover, data analysis can reveal surprisingly intrusive things: for example, the movement sensor in a smartwatch or fitness tracker can be used to work out what you are typing (for example, passwords) with fairly good accuracy [98]. And algorithms for analysis are only going to get better.

同意和选择自由

Consent and freedom of choice

我们可能会断言,用户自愿选择使用跟踪其活动的服务,并且他们已同意服务条款和隐私政策,因此他们同意数据收集。我们甚至可能声称用户正在接受有价值的服务以换取他们提供的数据,并且为了提供服务而进行跟踪是必要的。毫无疑问,社交网络、搜索引擎和各种其他免费在线服务对用户来说很有价值,但这种说法存在问题。

We might assert that users voluntarily choose to use a service that tracks their activity, and they have agreed to the terms of service and privacy policy, so they consent to data collection. We might even claim that users are receiving a valuable service in return for the data they provide, and that the tracking is necessary in order to provide the service. Undoubtedly, social networks, search engines, and various other free online services are valuable to users—but there are problems with this argument.

用户对他们将哪些数据输入我们的数据库,或者如何保留和处理这些数据知之甚少,而且大多数隐私政策更多的是掩盖而不是阐明。如果不了解他们的数据会发生什么,用户就无法给予任何有意义的同意。通常,来自一名用户的数据也会透露有关其他非该服务用户且未同意任何条款的人的信息。我们在本书的这一部分中讨论的派生数据集(其中来自整个用户群的数据可能与行为跟踪和外部数据源相结合)正是用户无法对其进行任何有意义的理解的数据类型。

Users have little knowledge of what data they are feeding into our databases, or how it is retained and processed—and most privacy policies do more to obscure than to illuminate. Without understanding what happens to their data, users cannot give any meaningful consent. Often, data from one user also says things about other people who are not users of the service and who have not agreed to any terms. The derived datasets that we discussed in this part of the book—in which data from the entire user base may have been combined with behavioral tracking and external data sources—are precisely the kinds of data of which users cannot have any meaningful understanding.

而且,数据是通过单向过程从用户那里提取的,不是真正互惠的关系,也不是公平的价值交换。没有对话,用户没有选项来协商他们提供多少数据以及他们收到什么服务作为回报:服务和用户之间的关系非常不对称和片面。这些条款是由服务设置的,而不是由用户设置的[ 99 ]。

Moreover, data is extracted from users through a one-way process, not a relationship with true reciprocity, and not a fair value exchange. There is no dialog, no option for users to negotiate how much data they provide and what service they receive in return: the relationship between the service and the user is very asymmetric and one-sided. The terms are set by the service, not by the user [99].

对于不同意监视的用户,唯一真正的选择就是不使用服务。但这种选择也不是免费的:如果一项服务如此受欢迎,以至于“大多数人认为它对于基本社会参与至关重要”[ 99 ],那么期望人们选择退出这项服务——使用它是不合理的事实上是强制性的。例如,在大多数西方社交社区中,携带智能手机、使用Facebook进行社交、使用Google查找信息已经成为常态。特别是当一项服务具有网络效应时,人们选择不使用它会产生社会成本。

For a user who does not consent to surveillance, the only real alternative is simply not to use a service. But this choice is not free either: if a service is so popular that it is “regarded by most people as essential for basic social participation” [99], then it is not reasonable to expect people to opt out of this service—using it is de facto mandatory. For example, in most Western social communities, it has become the norm to carry a smartphone, to use Facebook for socializing, and to use Google for finding information. Especially when a service has network effects, there is a social cost to people choosing not to use it.

由于跟踪用户而拒绝使用某项服务,只是少数人的一种选择,他们有足够的时间和知识来了解其隐私政策,并且有能力错过社交参与或职业机会。如果他们参与该服务可能会出现的机会。对于处于弱势地位的人来说,没有有意义的选择自由:监视变得不可避免。

Declining to use a service due to its tracking of users is only an option for the small number of people who are privileged enough to have the time and knowledge to understand its privacy policy, and who can afford to potentially miss out on social participation or professional opportunities that may have arisen if they had participated in the service. For people in a less privileged position, there is no meaningful freedom of choice: surveillance becomes inescapable.

隐私和数据使用

Privacy and use of data

有时,人们声称“隐私已死”,因为一些用户愿意将自己生活中的各种事情发布到社交媒体上,有时是平凡的,有时是非常私人的。然而,这种说法是错误的,是因为对隐私一词的误解。

Sometimes people claim that “privacy is dead” on the grounds that some users are willing to post all sorts of things about their lives to social media, sometimes mundane and sometimes deeply personal. However, this claim is false and rests on a misunderstanding of the word privacy.

拥有隐私并不意味着将所有事情保密;它意味着可以自由选择向谁透露哪些内容、公开哪些内容以及保密哪些内容。隐私权是一种决策权:它使每个人能够在每种情况下决定他们想要处于保密和透明之间的位置[ 99 ]。它是一个人的自由和自主的一个重要方面。

Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret. The right to privacy is a decision right: it enables each person to decide where they want to be on the spectrum between secrecy and transparency in each situation [99]. It is an important aspect of a person’s freedom and autonomy.

当通过监控基础设施从人们身上提取数据时,隐私权不一定会受到侵蚀,而是会转移给数据收集者。获取数据的公司本质上是在说“相信我们会用你的数据做正确的事”,这意味着决定透露什么和保密什么的权利从个人转移到了公司。

When data is extracted from people through surveillance infrastructure, privacy rights are not necessarily eroded, but rather transferred to the data collector. Companies that acquire data essentially say “trust us to do the right thing with your data,” which means that the right to decide what to reveal and what to keep secret is transferred from the individual to the company.

反过来,这些公司选择对这种监视的大部分结果保密,因为披露这些结果会被认为是令人毛骨悚然的,并且会损害他们的商业模式(这种模式依赖于比其他公司更多地了解人)。有关用户的私密信息仅以间接方式透露,例如以针对特定人群(例如患有特定疾病的人群)投放广告的工具的形式。

The companies in turn choose to keep much of the outcome of this surveillance secret, because to reveal it would be perceived as creepy, and would harm their business model (which relies on knowing more about people than other companies do). Intimate information about users is only revealed indirectly, for example in the form of tools for targeting advertisements to specific groups of people (such as those suffering from a particular illness).

即使无法从特定广告的目标人群中重新识别特定用户的个人身份,他们也失去了披露某些私密信息(例如他们是否患有某种疾病)的代理权。不是用户根据个人喜好决定向谁透露什么,而是公司以利润最大化为目标行使隐私权。

Even if particular users cannot be personally reidentified from the bucket of people targeted by a particular ad, they have lost their agency about the disclosure of some intimate information, such as whether they suffer from some illness. It is not the user who decides what is revealed to whom on the basis of their personal preferences—it is the company that exercises the privacy right with the goal of maximizing its profit.

许多公司的目标是不被认为令人毛骨悚然,避免他们的数据收集实际上有多侵入性的问题,而是专注于管理用户的看法。即使这些看法也常常管理不善:例如,某些事情可能实际上是正确的,但如果它触发了痛苦的记忆,用户可能不想被提醒[100 ]]。对于任何类型的数据,我们都应该预料到它在某种程度上可能是错误的、不受欢迎的或不适当的,并且我们需要建立处理这些故障的机制。某事是否“不合需要”或“不适当”当然取决于人的判断;算法不会注意到这些概念,除非我们明确地对它们进行编程以尊重人类的需求。作为这些系统的工程师,我们必须保持谦虚,接受此类失败并为此类失败做好计划。

Many companies have a goal of not being perceived as creepy—avoiding the question of how intrusive their data collection actually is, and instead focusing on managing user perceptions. And even these perceptions are often managed poorly: for example, something may be factually correct, but if it triggers painful memories, the user may not want to be reminded about it [100]. With any kind of data we should expect the possibility that it is wrong, undesirable, or inappropriate in some way, and we need to build mechanisms for handling those failures. Whether something is “undesirable” or “inappropriate” is of course down to human judgment; algorithms are oblivious to such notions unless we explicitly program them to respect human needs. As engineers of these systems we must be humble, accepting and planning for such failings.

允许在线服务的用户控制其他用户可以看到其数据的哪些方面的隐私设置是将某些控制权交还给用户的起点。然而,无论设置如何,服务本身仍然可以不受限制地访问数据,并且可以以隐私政策允许的任何方式自由使用数据。即使该服务承诺不将数据出售给第三方,它通常也会授予自己不受限制的内部处理和分析数据的权利,通常比用户公开可见的范围要远得多。

Privacy settings that allow a user of an online service to control which aspects of their data other users can see are a starting point for handing back some control to users. However, regardless of the setting, the service itself still has unfettered access to the data, and is free to use it in any way permitted by the privacy policy. Even if the service promises not to sell the data to third parties, it usually grants itself unrestricted rights to process and analyze the data internally, often going much further than what is overtly visible to users.

这种隐私权从个人到企业的大规模转移在历史上是前所未有的[ 99 ]。监控一直存在,但它过去非常昂贵并且是手动的,而不是可扩展和自动化的。信任关系一直存在,例如患者和医生之间,或被告和律师之间,但在这些情况下,数据的使用受到道德、法律和监管约束的严格约束。互联网服务使得在未经有意义同意的情况下收集大量敏感信息变得更加容易,并且在用户不了解其私人数据发生了什么情况下大规模使用这些信息变得更加容易。

This kind of large-scale transfer of privacy rights from individuals to corporations is historically unprecedented [99]. Surveillance has always existed, but it used to be expensive and manual, not scalable and automated. Trust relationships have always existed, for example between a patient and their doctor, or between a defendant and their attorney—but in these cases the use of data has been strictly governed by ethical, legal, and regulatory constraints. Internet services have made it much easier to amass huge amounts of sensitive information without meaningful consent, and to use it at massive scale without users understanding what is happening to their private data.

数据作为资产和力量

Data as assets and power

由于行为数据是用户与服务交互的副产品,因此有时被称为“数据耗尽”——表明这些数据是毫无价值的废物。从这个角度来看,行为和预测分析可以被视为一种回收形式,可以从原本会被丢弃的数据中提取价值。

Since behavioral data is a byproduct of users interacting with a service, it is sometimes called “data exhaust”—suggesting that the data is worthless waste material. Viewed this way, behavioral and predictive analytics can be seen as a form of recycling that extracts value from data that would have otherwise been thrown away.

更正确的做法是反过来看:从经济角度来看,如果有针对性的广告是为服务付费的,那么人们的行为数据就是该服务的核心资产。在这种情况下,用户交互的应用程序仅仅是诱使用户将越来越多的个人信息输入监控基础设施的一种手段[ 99 ]。在线服务中经常体现出令人愉悦的人类创造力和社会关系,却被数据提取机器冷嘲热讽地利用了。

More correct would be to view it the other way round: from an economic point of view, if targeted advertising is what pays for a service, then behavioral data about people is the service’s core asset. In this case, the application with which the user interacts is merely a means to lure users into feeding more and more personal information into the surveillance infrastructure [99]. The delightful human creativity and social relationships that often find expression in online services are cynically exploited by the data extraction machine.

数据经纪人的存在支持了个人数据是宝贵资产的主张,数据经纪人是一个秘密运营的阴暗行业,购买、汇总、分析、推断和转售有关人们的侵入性个人数据,主要用于营销目的[90 ]。初创公司的价值取决于其用户数量、“眼球”,即监控能力。

The assertion that personal data is a valuable asset is supported by the existence of data brokers, a shady industry operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive personal data about people, mostly for marketing purposes [90]. Startups are valued by their user numbers, by “eyeballs”—i.e., by their surveillance capabilities.

因为数据很有价值,所以很多人都想要它。公司当然想要它——这就是他们首先收集它的原因。但政府也想获得它:通过秘密交易、胁迫、法律强制或干脆窃取[ 101 ]。当一家公司破产时,它收集的个人数据就是被出售的资产之一。此外,数据很难保护,因此违规事件经常发生,令人不安[ 102 ]。

Because the data is valuable, many people want it. Of course companies want it—that’s why they collect it in the first place. But governments want to obtain it too: by means of secret deals, coercion, legal compulsion, or simply stealing it [101]. When a company goes bankrupt, the personal data it has collected is one of the assets that get sold. Moreover, the data is difficult to secure, so breaches happen disconcertingly often [102].

这些观察结果导致批评者说数据不仅是一种资产,而且是一种“有毒资产”[ 101 ],或者至少是“危险材料”[ 103 ]。即使我们认为我们有能力防止数据滥用,但每当我们收集数据时,我们都需要平衡利益与落入坏人之手的风险:计算机系统可能会受到犯罪分子或敌对外国情报服务、数据的损害。数据可能被内部人士泄露,公司可能落入与我们价值观不同的无良管理层手中,或者国家可能被一个毫不犹豫地迫使我们交出数据的政权接管。

These observations have led critics to saying that data is not just an asset, but a “toxic asset” [101], or at least “hazardous material” [103]. Even if we think that we are capable of preventing abuse of data, whenever we collect data, we need to balance the benefits with the risk of it falling into the wrong hands: computer systems may be compromised by criminals or hostile foreign intelligence services, data may be leaked by insiders, the company may fall into the hands of unscrupulous management that does not share our values, or the country may be taken over by a regime that has no qualms about compelling us to hand over the data.

在收集数据时,我们不仅需要考虑当今的政治环境,还需要考虑未来所有可能的政府。无法保证未来每一个当选的政府都会尊重人权和公民自由,因此“安装有朝一日可能促进警察国家的技术是糟糕的公民卫生”[104 ]

When collecting data, we need to consider not just today’s political environment, but all possible future governments. There is no guarantee that every government elected in future will respect human rights and civil liberties, so “it is poor civic hygiene to install technologies that could someday facilitate a police state” [104].

正如古老的格言所说,“知识就是力量”。而且,“审视他人而不审视自己是最重要的权力形式之一”[ 105 ]。这就是极权政府想要监视的原因:这赋予他们控制人民的权力。尽管今天的科技公司并没有公开寻求政治权力,但他们所积累的数据和知识却赋予了他们对我们生活的巨大权力,其中大部分是秘密的,不受公众监督[106 ]

“Knowledge is power,” as the old adage goes. And furthermore, “to scrutinize others while avoiding scrutiny oneself is one of the most important forms of power” [105]. This is why totalitarian governments want surveillance: it gives them the power to control the population. Although today’s technology companies are not overtly seeking political power, the data and knowledge they have accumulated nevertheless gives them a lot of power over our lives, much of which is surreptitious, outside of public oversight [106].

记住工业革命

Remembering the Industrial Revolution

数据是信息时代的决定性特征。互联网、数据存储、处理和软件驱动的自动化正在对全球经济和人类社会产生重大影响。由于我们的日常生活和社会组织在过去十年中发生了变化,并且可能在未来几十年中继续发生根本性变化,因此我们会想到与工业革命的比较[ 87 , 96 ]。

Data is the defining feature of the information age. The internet, data storage, processing, and software-driven automation are having a major impact on the global economy and human society. As our daily lives and social organization have changed in the past decade, and will probably continue to radically change in the coming decades, comparisons to the Industrial Revolution come to mind [87, 96].

工业革命是通过重大技术和农业进步而实现的,从长远来看,它带来了持续的经济增长和生活水平的显着提高。然而,它也带来了重大问题:空气(由于烟雾和化学过程)和水(来自工业和人类废物)的污染非常严重。工厂主生活富丽堂皇,而城市工人往往居住在非常简陋的住房中,并在恶劣的条件下长时间工作。童工现象很常见,包括矿山中危险且报酬微薄的工作。

The Industrial Revolution came about through major technological and agricultural advances, and it brought sustained economic growth and significantly improved living standards in the long run. Yet it also came with major problems: pollution of the air (due to smoke and chemical processes) and the water (from industrial and human waste) was dreadful. Factory owners lived in splendor, while urban workers often lived in very poor housing and worked long hours in harsh conditions. Child labor was common, including dangerous and poorly paid work in mines.

环境保护法规、工作场所安全规程、取缔童工和食品卫生检查等保障措施花了很长时间才建立起来。毫无疑问,当工厂不能再将废物倒入河流、出售受污染的食品或剥削工人时,做生意的成本就会增加。但整个社会受益匪浅,我们中很少有人愿意回到这些规定之前的时代[ 87 ]。

It took a long time before safeguards were established, such as environmental protection regulations, safety protocols for workplaces, outlawing child labor, and health inspections for food. Undoubtedly the cost of doing business increased when factories could no longer dump their waste into rivers, sell tainted foods, or exploit workers. But society as a whole benefited hugely, and few of us would want to return to a time before those regulations [87].

正如工业革命有需要管理的阴暗面一样,我们向信息时代的过渡也有我们需要面对和解决的重大问题。我认为数据的收集和使用是这些问题之一。用 Bruce Schneier 的话说 [ 96 ]:

Just as the Industrial Revolution had a dark side that needed to be managed, our transition to the information age has major problems that we need to confront and solve. I believe that the collection and use of data is one of those problems. In the words of Bruce Schneier [96]:

数据是信息时代的污染问题,保护隐私是环境挑战。几乎所有计算机都会产生信息。它留在周围,溃烂。我们如何处理它——如何遏制它以及如何处置它——对于我们信息经济的健康发展至关重要。正如我们今天回顾工业时代的最初几十年,想知道我们的祖先为何在急于建设工业世界的过程中忽视了污染一样,我们的子孙也会在信息时代的最初几十年回顾我们并评判我们关于我们如何应对数据收集和滥用的挑战。

我们应该努力让他们感到自豪。

Data is the pollution problem of the information age, and protecting privacy is the environmental challenge. Almost all computers produce information. It stays around, festering. How we deal with it—how we contain it and how we dispose of it—is central to the health of our information economy. Just as we look back today at the early decades of the industrial age and wonder how our ancestors could have ignored pollution in their rush to build an industrial world, our grandchildren will look back at us during these early decades of the information age and judge us on how we addressed the challenge of data collection and misuse.

We should try to make them proud.

立法和自律

Legislation and self-regulation

数据保护法或许能够帮助保护个人权利。例如,1995 年《欧洲数据保护指令》规定,个人数据必须“出于指定、明确和合法的目的而收集,并且不得以与这些目的不相容的方式进一步处理”,此外,数据必须“充分、相关且不过分”。与收集它们的目的有关”[ 107 ]。

Data protection laws might be able to help preserve individuals’ rights. For example, the 1995 European Data Protection Directive states that personal data must be “collected for specified, explicit and legitimate purposes and not further processed in a way incompatible with those purposes,” and furthermore that data must be “adequate, relevant and not excessive in relation to the purposes for which they are collected” [107].

然而,这项立法在当今的互联网背景下是否有效值得怀疑[ 108 ]。这些规则直接违背了大数据的理念,即最大限度地收集数据,将其与其他数据集结合起来,进行实验和探索,以产生新的见解。探索意味着将数据用于不可预见的目的,这与用户同意的“指定和明确”目的相反(如果我们可以有意义地谈论同意[109] 。目前正在制定更新的法规[ 89 ]。

However, it is doubtful whether this legislation is effective in today’s internet context [108]. These rules run directly counter to the philosophy of Big Data, which is to maximize data collection, to combine it with other datasets, to experiment and to explore in order to generate new insights. Exploration means using data for unforeseen purposes, which is the opposite of the “specified and explicit” purposes for which the user gave their consent (if we can meaningfully speak of consent at all [109]). Updated regulations are now being developed [89].

收集大量人员数据的公司反对监管,认为监管是创新的负担和障碍。从某种程度上来说,这种反对是有道理的。例如,在共享医疗数据时,存在明显的隐私风险,但也存在潜在的机会:如果数据分析能够帮助我们实现更好的诊断或找到更好的治疗方法,可以避免多少死亡[110 ]?过度监管可能会阻碍此类突破。很难平衡这种潜在机会与风险[ 105 ]。

Companies that collect lots of data about people oppose regulation as being a burden and a hindrance to innovation. To some extent that opposition is justified. For example, when sharing medical data, there are clear risks to privacy, but there are also potential opportunities: how many deaths could be prevented if data analysis was able to help us achieve better diagnostics or find better treatments [110]? Over-regulation may prevent such breakthroughs. It is difficult to balance such potential opportunities with the risks [105].

从根本上说,我认为我们需要改变科技行业在个人数据方面的文化转变。我们应该停止将用户视为需要优化的指标,并记住他们是值得尊重、尊严和代理权的人。我们应该自我监管我们的数据收集和处理实践,以建立和维持依赖我们软件的人们的信任[ 111 ]。我们应该自己承担起教育最终用户如何使用他们的数据的责任,而不是让他们蒙在鼓里。

Fundamentally, I think we need a culture shift in the tech industry with regard to personal data. We should stop regarding users as metrics to be optimized, and remember that they are humans who deserve respect, dignity, and agency. We should self-regulate our data collection and processing practices in order to establish and maintain the trust of the people who depend on our software [111]. And we should take it upon ourselves to educate end users about how their data is used, rather than keeping them in the dark.

我们应该允许每个人维护自己的隐私,即他们对自己数据的控制权,而不是通过监视窃取他们的控制权。我们个人控制数据的权利就像国家公园的自然环境:如果我们不明确保护和爱护它,它就会被破坏。这将是公地悲剧,我们所有人都会因此而变得更糟。无处不在的监视并非不可避免——我们仍然能够阻止它。

We should allow each individual to maintain their privacy—i.e., their control over own data—and not steal that control from them through surveillance. Our individual right to control our data is like the natural environment of a national park: if we don’t explicitly protect and care for it, it will be destroyed. It will be the tragedy of the commons, and we will all be worse off for it. Ubiquitous surveillance is not inevitable—we are still able to stop it.

我们究竟如何实现这一目标是一个悬而未决的问题。首先,我们不应该永远保留数据,而应该在不再需要时立即清除它[ 111 , 112 ]。清除数据与不变性的想法背道而驰(请参阅 “不变性的局限性”),但这个问题是可以解决的。我认为一个有前途的方法是通过加密协议来强制访问控制,而不仅仅是通过策略 [ 113 , 114 ]。总体而言,文化和态度的改变是必要的。

How exactly we might achieve this is an open question. To begin with, we should not retain data forever, but purge it as soon as it is no longer needed [111, 112]. Purging data runs counter to the idea of immutability (see “Limitations of immutability”), but that issue can be solved. A promising approach I see is to enforce access control through cryptographic protocols, rather than merely by policy [113, 114]. Overall, culture and attitude changes will be necessary.

概括

Summary

在本章中,我们讨论了设计数据系统的新方法,其中包括我个人的观点和对未来的猜测。我们首先观察到,没有一种工具可以有效地服务所有可能的用例,因此应用程序必然需要组合多个不同的软件来实现其目标。我们讨论了如何通过使用批处理和事件流来让数据更改在不同系统之间流动来解决此数据集成问题。

In this chapter we discussed new approaches to designing data systems, and I included my personal opinions and speculations about the future. We started with the observation that there is no one single tool that can efficiently serve all possible use cases, and so applications necessarily need to compose several different pieces of software to accomplish their goals. We discussed how to solve this data integration problem by using batch processing and event streams to let data changes flow between different systems.

在这种方法中,某些系统被指定为记录系统,而其他数据则通过转换从它们导出。通过这种方式我们可以维护索引、物化视图、机器学习模型、统计摘要等。通过使这些推导和转换异步且松散耦合,可以防止某一区域的问题蔓延到系统的不相关部分,从而提高整个系统的鲁棒性和容错性。

In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformations asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole.

将数据流表示为从一个数据集到另一个数据集的转换也有助于改进应用程序:如果您想更改其中一个处理步骤,例如更改索引或缓存的结构,您只需在整个输入数据集上重新运行新的转换代码即可为了重新获得输出。同样,如果出现问题,您可以修复代码并重新处理数据以进行恢复。

Expressing dataflows as transformations from one dataset to another also helps evolve applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly, if something goes wrong, you can fix the code and reprocess the data in order to recover.

这些过程与数据库内部已经执行的操作非常相似,因此我们将数据流应用程序的想法重新定义为分拆数据库的组件,并通过组合这些松散耦合的组件来构建应用程序。

These processes are quite similar to what databases already do internally, so we recast the idea of dataflow applications as unbundling the components of a database, and building an application by composing these loosely coupled components.

可以通过观察基础数据的变化来更新派生状态。而且,派生状态本身可以进一步被下游消费者观察到。我们甚至可以将此数据流一直传送到显示数据的最终用户设备,从而构建动态更新以反映数据更改并继续离线工作的用户界面。

Derived state can be updated by observing changes in the underlying data. Moreover, the derived state itself can further be observed by downstream consumers. We can even take this dataflow all the way through to the end-user device that is displaying the data, and thus build user interfaces that dynamically update to reflect data changes and continue to work offline.

接下来,我们讨论了如何确保所有这些处理在出现故障时仍然正确。我们看到,通过使用端到端操作标识符使操作幂等并通过异步检查约束,可以通过异步事件处理来可扩展地实现强完整性保证。客户可以等到检查通过,也可以不等待就继续操作,但可能要冒着因违反约束而道歉的风险。这种方法比使用分布式事务的传统方法更具可扩展性和健壮性,并且适合实际工作的业务流程数量。

Next, we discussed how to ensure that all of this processing remains correct in the presence of faults. We saw that strong integrity guarantees can be implemented scalably with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk having to apologize about a constraint violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice.

通过围绕数据流构建应用程序并异步检查约束,我们可以避免大多数协调并创建保持完整性但仍然性能良好的系统,即使在地理分布的场景和存在故障的情况下也是如此。然后我们讨论了如何使用审计来验证数据的完整性并检测损坏。

By structuring applications around dataflow and checking constraints asynchronously, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the presence of faults. We then talked a little about using audits to verify the integrity of data and detect corruption.

最后,我们退后一步,研究了构建数据密集型应用程序的一些道德问题。我们看到,虽然数据可以用来做好事,但它也可能造成重大危害:做出严重影响人们生活且难以上诉的合理决定,导致歧视和剥削、监视常态化以及暴露私密信息。我们还面临着数据泄露的风险,我们可能会发现善意的数据使用会产生意想不到的后果。

Finally, we took a step back and examined some ethical aspects of building data-intensive applications. We saw that although data can be used to do good, it can also do significant harm: making justifying decisions that seriously affect people’s lives and are difficult to appeal against, leading to discrimination and exploitation, normalizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences.

由于软件和数据对世界产生如此巨大的影响,我们工程师必须记住,我们有责任为我们想要生活的世界而努力:一个以人性和尊重对待人们的世界。我希望我们能够共同努力实现这一目标。

As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.

脚注

解释一个笑话很少能改善它,但我不想让任何人感到被忽视。在这里, Church指的是数学家 Alonzo Church,他创建了 lambda 演算,这是一种早期的计算形式,是大多数函数式编程语言的基础。lambda 演算没有可变状态(即没有可以覆盖的变量),因此可以说可变状态与 Church 的工作是分开的。

i Explaining a joke rarely improves it, but I don’t want anyone to feel left out. Here, Church is a reference to the mathematician Alonzo Church, who created the lambda calculus, an early form of computation that is the basis for most functional programming languages. The lambda calculus has no mutable state (i.e., no variables that can be overwritten), so one could say that mutable state is separate from Church’s work.

ii在微服务方法中,您可以通过在处理购买的服务中本地缓存汇率来避免同步网络请求。但是,为了保持缓存最新,您需要定期轮询更新的汇率,或订阅更改流 - 这正是数据流方法中发生的情况。

ii In the microservices approach, you could avoid the synchronous network request by caching the exchange rate locally in the service that processes the purchase. However, in order to keep that cache fresh, you would need to periodically poll for updated exchange rates, or subscribe to a stream of changes—which is exactly what happens in the dataflow approach.

iii不开玩笑的是,假设语料库有限,具有非空搜索结果的不同搜索查询集是有限的。然而,语料库中的术语数量将呈指数级增长,这仍然是一个相当坏的消息。

iii Less facetiously, the set of distinct search queries with nonempty search results is finite, assuming a finite corpus. However, it would be exponential in the number of terms in the corpus, which is still pretty bad news.

参考

[ 1 ] Rachid Belaid:“ Postgres 全文搜索已经足够好了!”,rachbelaid.com,2015 年 7 月 13 日。

[1] Rachid Belaid: “Postgres Full-Text Search is Good Enough!,” rachbelaid.com, July 13, 2015.

[ 2 ] Philippe Ajoux、Nathan Bronson、Sanjeev Kumar 等人:“大规模采用更强一致性的挑战”,第 15 届 USENIX 操作系统热门主题研讨会(HotOS),2015 年 5 月。

[2] Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “Challenges to Adopting Stronger Consistency at Scale,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

[ 3 ]Pat Helland 和 Dave Campbell:“ Building on Quicksand ”, 第四届创新数据系统研究双年度会议(CIDR),2009 年 1 月。

[3] Pat Helland and Dave Campbell: “Building on Quicksand,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

[ 4 ] Jessica Kerr:“分布式系统中的起源和因果关系”,blog.jessitron.com,2016 年 9 月 25 日。

[4] Jessica Kerr: “Provenance and Causality in Distributed Systems,” blog.jessitron.com, September 25, 2016.

[ 5 ] Kostas Tzoumas:“批处理是流媒体的一个特例”,data-artisans.com,2015 年 9 月 15 日。

[5] Kostas Tzoumas: “Batch Is a Special Case of Streaming,” data-artisans.com, September 15, 2015.

[ 6 ] Shinji Kim 和 Robert Blafford:“流窗口性能分析:Concord 和 Spark Streaming ”,concord.io,2016 年 7 月 6 日。

[6] Shinji Kim and Robert Blafford: “Stream Windowing Performance Analysis: Concord and Spark Streaming,” concord.io, July 6, 2016.

[ 7 ] Jay Kreps:“日志:每个软件工程师都应该了解实时数据的统一抽象”, engineering.linkedin.com,2013 年 12 月 16 日。

[7] Jay Kreps: “The Log: What Every Software Engineer Should Know About Real-Time Data’s Unifying Abstraction,” engineering.linkedin.com, December 16, 2013.

[ 8 ] Pat Helland:“超越分布式交易的生活:叛教者的观点”,第三届创新数据系统研究双年度会议(CIDR),2007 年 1 月。

[8] Pat Helland: “Life Beyond Distributed Transactions: An Apostate’s Opinion,” at 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 2007.

[ 9 ]“大西部铁路(1835-1948) ”,网络铁路虚拟档案,networkrail.co.uk

[9] “Great Western Railway (1835–1948),” Network Rail Virtual Archive, networkrail.co.uk.

[ 10 ] Jacqueline Xu:“大规模在线迁移”, stripe.com,2017 年 2 月 2 日。

[10] Jacqueline Xu: “Online Migrations at Scale,” stripe.com, February 2, 2017.

[ 11 ] Molly Bartlett Dishman 和 Martin Fowler:“敏捷架构”,O'Reilly 软件架构会议,2015 年 3 月。

[11] Molly Bartlett Dishman and Martin Fowler: “Agile Architecture,” at O’Reilly Software Architecture Conference, March 2015.

[ 12 ] Nathan Marz 和 James Warren: 大数据:可扩展实时数据系统的原理和最佳实践。曼宁,2015。ISBN:978-1-617-29034-3

[12] Nathan Marz and James Warren: Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning, 2015. ISBN: 978-1-617-29034-3

[ 13 ] Oscar Boykin、Sam Ritchie、Ian O'Connell 和 Jimmy Lin:“ Summingbird:集成批处理和在线 MapReduce 计算的框架”,第40 届国际超大型数据库会议(VLDB),2014 年 9 月。

[13] Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin: “Summingbird: A Framework for Integrating Batch and Online MapReduce Computations,” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.

[ 14 ] Jay Kreps:“质疑 Lambda 架构”,oreilly.com,2014 年 7 月 2 日。

[14] Jay Kreps: “Questioning the Lambda Architecture,” oreilly.com, July 2, 2014.

[ 15 ] Raul Castro Fernandez、Peter Pietzuch、Jay Kreps 等人:“ Liquid:统一近线和离线大数据集成”,第七届创新数据系统研究双年度会议(CIDR),2015 年 1 月。

[15] Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “Liquid: Unifying Nearline and Offline Big Data Integration,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

[ 16 ] Dennis M. Ritchie 和 Ken Thompson:“ The UNIX Time-Sharing System ”,Communications of the ACM,第 17 卷,第 7 期,第 365–375 页,1974 年 7 月 。doi:10.1145/361011.361061

[16] Dennis M. Ritchie and Ken Thompson: “The UNIX Time-Sharing System,” Communications of the ACM, volume 17, number 7, pages 365–375, July 1974. doi:10.1145/361011.361061

[ 17 ] Eric A. Brewer 和 Joseph M. Hellerstein:“ CS262a:计算机系统高级主题”,讲座笔记,加州大学伯克利分校,cs.berkeley.edu,2011 年 8 月。

[17] Eric A. Brewer and Joseph M. Hellerstein: “CS262a: Advanced Topics in Computer Systems,” lecture notes, University of California, Berkeley, cs.berkeley.edu, August 2011.

[ 18 ] Michael Stonebraker:“ Polystores 案例”,wp.sigmod.org,2015 年 7 月 13 日。

[18] Michael Stonebraker: “The Case for Polystores,” wp.sigmod.org, July 13, 2015.

[ 19 ] Jennie Duggan、Aaron J. Elmore、Michael Stonebraker 等人:“ The BigDAWG Polystore System ”,ACM SIGMOD Record,第 44 卷,第 2 期,第 11-16 页,2015 年 6 月 。doi:10.1145/2814710.2814713

[19] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “The BigDAWG Polystore System,” ACM SIGMOD Record, volume 44, number 2, pages 11–16, June 2015. doi:10.1145/2814710.2814713

[ 20 ]Patrycja Dybka:“ PostgreSQL 的外部数据包装器”,vertabelo.com,2015 年 3 月 24 日。

[20] Patrycja Dybka: “Foreign Data Wrappers for PostgreSQL,” vertabelo.com, March 24, 2015.

[ 21 ] David B. Lomet、Alan Fekete、Gerhard Weikum 和 Mike Zwilling:“在云中分拆交易服务”,第四届创新数据系统研究双年度会议(CIDR),2009 年 1 月。

[21] David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “Unbundling Transaction Services in the Cloud,” at 4th Biennial Conference on Innovative Data Systems Research (CIDR), January 2009.

[ 22 ] Martin Kleppmann 和 Jay Kreps:“ Kafka、Samza 和 Unix 分布式数据哲学”,IEEE 数据工程公告,第 38 卷,第 4 期,第 4-14 页,2015 年 12 月。

[22] Martin Kleppmann and Jay Kreps: “Kafka, Samza and the Unix Philosophy of Distributed Data,” IEEE Data Engineering Bulletin, volume 38, number 4, pages 4–14, December 2015.

[ 23 ] John Hugg:“赢得现在和未来:VoltDB 的闪光点”,voltdb.com,2016 年 3 月 23 日。

[23] John Hugg: “Winning Now and in the Future: Where VoltDB Shines,” voltdb.com, March 23, 2016.

[ 24 ] Frank McSherry、Derek G. Murray、Rebecca Isaacs 和 Michael Isard:“差异化数据流”,第六届创新数据系统研究双年度会议(CIDR),2013 年 1 月。

[24] Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “Differential Dataflow,” at 6th Biennial Conference on Innovative Data Systems Research (CIDR), January 2013.

[ 25 ] Derek G Murray、Frank McSherry、Rebecca Isaacs 等人:“ Naiad:及时数据流系统”,第 24 届 ACM 操作系统原理研讨会(SOSP),第 439-455 页,2013 年 11 月 。doi:10.1145/ 2517349.2522738

[25] Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “Naiad: A Timely Dataflow System,” at 24th ACM Symposium on Operating Systems Principles (SOSP), pages 439–455, November 2013. doi:10.1145/2517349.2522738

[ 26 ] Gwen Shapira:“我们有一群客户正在实施‘数据库由内而外’的概念,他们都问‘还有其他人在这样做吗?我们疯了吗?twitter.com,2016 年 7 月 28 日。

[26] Gwen Shapira: “We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’twitter.com, July 28, 2016.

[ 27 ] Martin Kleppmann:“用 Apache Samza 将数据库彻底翻转” , Strange Loop,2014 年 9 月。

[27] Martin Kleppmann: “Turning the Database Inside-out with Apache Samza,” at Strange Loop, September 2014.

[ 28 ] Peter Van Roy 和 Seif Haridi: 计算机编程的概念、技术和模型。麻省理工学院出版社,2004 年。ISBN:978-0-262-22069-9

[28] Peter Van Roy and Seif Haridi: Concepts, Techniques, and Models of Computer Programming. MIT Press, 2004. ISBN: 978-0-262-22069-9

[ 29 ]“ Juttle 文档”,juttle.github.io,2016 年。

[29] “Juttle Documentation,” juttle.github.io, 2016.

[ 30 ] Evan Czaplicki 和 Stephen Chong:“ GUI 的异步功能响应式编程”,第 34 届 ACM SIGPLAN 编程语言设计与实现(PLDI) 会议,2013 年 6 月 。doi:10.1145/2491956.2462161

[30] Evan Czaplicki and Stephen Chong: “Asynchronous Functional Reactive Programming for GUIs,” at 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2013. doi:10.1145/2491956.2462161

[ 31 ] 工程师 Bainomugisha、Andoni Lombide Carreton、Tom van Cutsem、Stijn Mostinckx 和 Wolfgang de Meuter:“响应式编程调查”,ACM 计算调查,第 45 卷,第 4 期,第 1-34 页,2013 年 8 月 。 10.1145/2501654.2501666

[31] Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “A Survey on Reactive Programming,” ACM Computing Surveys, volume 45, number 4, pages 1–34, August 2013. doi:10.1145/2501654.2501666

[ 32 ] Peter Alvaro、Neil Conway、Joseph M. Hellerstein 和 William R. Marczak:“ Bloom 中的一致性分析:一种平静和收集的方法”,第五届创新数据系统研究双年会 (CIDR),2011 年 1 月。

[32] Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “Consistency Analysis in Bloom: A CALM and Collected Approach,” at 5th Biennial Conference on Innovative Data Systems Research (CIDR), January 2011.

[ 33 ] Felienne Hermans:“电子表格就是代码”,Code Mesh,2015 年 11 月。

[33] Felienne Hermans: “Spreadsheets Are Code,” at Code Mesh, November 2015.

[ 34 ] Dan Bricklin 和 Bob Frankston:“ VisiCalc:来自其创建者的信息”,danbricklin.com

[34] Dan Bricklin and Bob Frankston: “VisiCalc: Information from Its Creators,” danbricklin.com.

[ 35 ] D. Sculley、Gary Holt、Daniel Golovin 等人:“机器学习:技术债务的高息信用卡”,NIPS 机器学习软件工程研讨会 (SE4ML),2014 年 12 月。

[35] D. Sculley, Gary Holt, Daniel Golovin, et al.: “Machine Learning: The High-Interest Credit Card of Technical Debt,” at NIPS Workshop on Software Engineering for Machine Learning (SE4ML), December 2014.

[ 36 ] Peter Bailis、Alan Fekete、Michael J Franklin 等人:“ Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity ”,ACM 国际数据管理会议(SIGMOD),2015 年 6 月 。doi:10.1145/ 2723372.2737784

[36] Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2737784

[ 37 ] Guy Steele:“回复:需要宏(以前是回复:图标) ”,发送至ll1-discuss邮件列表的电子邮件,people.csail.mit.edu,2001 年 12 月 24 日。

[37] Guy Steele: “Re: Need for Macros (Was Re: Icon),” email to ll1-discuss mailing list, people.csail.mit.edu, December 24, 2001.

[ 38 ] David Gelernter:“ Linda 中的生成式通信”,ACM 编程语言和系统汇刊 (TOPLAS),第 7 卷,第 1 期,第 80–112 页,1985 年 1 月 。doi:10.1145/2363.2433

[38] David Gelernter: “Generative Communication in Linda,” ACM Transactions on Programming Languages and Systems (TOPLAS), volume 7, number 1, pages 80–112, January 1985. doi:10.1145/2363.2433

[ 39 ] 帕特里克·Th。Eugster、Pascal A. Felber、Rachid Guerraoui 和 Anne-Marie Kermarrec:“发布/订阅的诸多方面”, ACM 计算调查,第 35 卷,第 2 期,第 114–131 页,2003 年 6 月 。doi:10.1145/857076.857078

[39] Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “The Many Faces of Publish/Subscribe,” ACM Computing Surveys, volume 35, number 2, pages 114–131, June 2003. doi:10.1145/857076.857078

[ 40 ] Ben Stopford:“流媒体世界中的微服务”,伦敦 QCon,2016 年 3 月。

[40] Ben Stopford: “Microservices in a Streaming World,” at QCon London, March 2016.

[ 41 ] Christian Posta:“为什么微服务应该由事件驱动:自治与权威”,blog.christianposta.com,2016 年 5 月 27 日。

[41] Christian Posta: “Why Microservices Should Be Event Driven: Autonomy vs Authority,” blog.christianposta.com, May 27, 2016.

[ 42 ] Alex Feyerke:“首先向离线问好”, hood.ie,2013 年 11 月 5 日。

[42] Alex Feyerke: “Say Hello to Offline First,” hood.ie, November 5, 2013.

[ 43 ] Sebastian Burckhardt、Daan Leijen、Jonathan Protzenko 和 Manuel Fähndrich:“全局序列协议:复制共享状态的鲁棒抽象”,第 29 届欧洲面向对象编程会议(ECOOP),2015 年 7 月 。doi:10.4230/ LIPICs.ECOOP.2015.568

[43] Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “Global Sequence Protocol: A Robust Abstraction for Replicated Shared State,” at 29th European Conference on Object-Oriented Programming (ECOOP), July 2015. doi:10.4230/LIPIcs.ECOOP.2015.568

[ 44 ]Mark Soper:“用 Flux、Redux 和 Relay 消除 React 数据管理混乱”,medium.com,2015 年 12 月 3 日。

[44] Mark Soper: “Clearing Up React Data Management Confusion with Flux, Redux, and Relay,” medium.com, December 3, 2015.

[ 45 ] Eno Thereska、Damian Guy、Michael Noll 和 Neha Narkhede:“在 Apache Kafka 中统一流处理和交互式查询”,confluence.io,2016 年 10 月 26 日。

[45] Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “Unifying Stream Processing and Interactive Queries in Apache Kafka,” confluent.io, October 26, 2016.

[ 46 ] Frank McSherry:“数据流作为数据库”,github.com,2016 年 7 月 17 日。

[46] Frank McSherry: “Dataflow as Database,” github.com, July 17, 2016.

[ 47 ] Peter Alvaro:“我明白你的意思”,Strange Loop,2015 年 9 月。

[47] Peter Alvaro: “I See What You Mean,” at Strange Loop, September 2015.

[ 48 ] Nathan Marz:“ Trident:实时计算的高级抽象”,blog.twitter.com,2012 年 8 月 2 日。

[48] Nathan Marz: “Trident: A High-Level Abstraction for Realtime Computation,” blog.twitter.com, August 2, 2012.

[ 49 ] Edi Bice:“利用 Apache Samza、Kafka 和朋友进行低延迟网络规模欺诈预防”,商户风险委员会 MRC 维加斯会议,2016 年 3 月。

[49] Edi Bice: “Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends,” at Merchant Risk Council MRC Vegas Conference, March 2016.

[ 50 ] 慈善专业:“意外的 DBA ”,charity.wtf,2016 年 10 月 2 日。

[50] Charity Majors: “The Accidental DBA,” charity.wtf, October 2, 2016.

[ 51 ] Arthur J. Bernstein、Philip M. Lewis 和 Shiyong Lu:“不同隔离级别正确性的语义条件”,第 16 届国际数据工程会议(ICDE),2000 年 2 月 。doi:10.1109/ICDE.2000.839387

[51] Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “Semantic Conditions for Correctness at Different Isolation Levels,” at 16th International Conference on Data Engineering (ICDE), February 2000. doi:10.1109/ICDE.2000.839387

[ 52 ] Sudhir Jorwekar、Alan Fekete、Krithi Ramamritham 和 S. Sudarshan:“自动检测快照隔离异常”,第 33 届超大型数据库国际会议(VLDB),2007 年 9 月。

[52] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “Automating the Detection of Snapshot Isolation Anomalies,” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[ 53 ] 凯尔·金斯伯里: Jepsen 博客文章系列aphyr.com,2013-2016。

[53] Kyle Kingsbury: Jepsen blog post series, aphyr.com, 2013–2016.

[ 54 ] Michael Jouravlev:“发布后重定向”, theserverside.com,2004 年 8 月 1 日。

[54] Michael Jouravlev: “Redirect After Post,” theserverside.com, August 1, 2004.

[ 55 ] Jerome H. Saltzer、David P. Reed 和 David D. Clark:“系统设计中的端到端论证”,ACM Transactions on Computer Systems,第 2 卷,第 4 期,第 277-288 页,1984 年 11 月.doi :10.1145/357401.357402

[55] Jerome H. Saltzer, David P. Reed, and David D. Clark: “End-to-End Arguments in System Design,” ACM Transactions on Computer Systems, volume 2, number 4, pages 277–288, November 1984. doi:10.1145/357401.357402

[ 56 ] Peter Bailis、Alan Fekete、Michael J. Franklin 等人:“避免协调的数据库系统”, VLDB Endowment 论文集,第 8 卷,第 3 期,第 185-196 页,2014 年 11 月。

[56] Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “Coordination-Avoiding Database Systems,” Proceedings of the VLDB Endowment, volume 8, number 3, pages 185–196, November 2014.

[ 57 ] Alex Yamula:“曼哈顿的强一致性”,blog.twitter.com,2016 年 3 月 17 日。

[57] Alex Yarmula: “Strong Consistency in Manhattan,” blog.twitter.com, March 17, 2016.

[ 58 ] Douglas B Terry、Marvin M Theimer、Karin Petersen 等人:“管理 Bayou(弱连接复制存储系统)中的更新冲突”,第 15 届 ACM 操作系统原理研讨会(SOSP),第 172-182 页, 1995 年 12 月 。doi:10.1145/224056.224070

[58] Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System,” at 15th ACM Symposium on Operating Systems Principles (SOSP), pages 172–182, December 1995. doi:10.1145/224056.224070

[ 59 ] Jim Gray:“事务概念:优点和局限性”,第 7 届超大型数据库国际会议(VLDB),1981 年 9 月。

[59] Jim Gray: “The Transaction Concept: Virtues and Limitations,” at 7th International Conference on Very Large Data Bases (VLDB), September 1981.

[ 60 ] Hector Garcia-Molina 和 Kenneth Salem:“ Sagas ”, ACM 国际数据管理会议(SIGMOD),1987 年 5 月 。doi:10.1145/38713.38742

[60] Hector Garcia-Molina and Kenneth Salem: “Sagas,” at ACM International Conference on Management of Data (SIGMOD), May 1987. doi:10.1145/38713.38742

[ 61 ] Pat Helland:“回忆、猜测和道歉”,blogs.msdn.com,2007 年 5 月 15 日。

[61] Pat Helland: “Memories, Guesses, and Apologies,” blogs.msdn.com, May 15, 2007.

[ 62 ] Yoongu Kim、Ross Daly、Jeremie Kim 等人:“ Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors ”,第41 届计算机体系结构国际研讨会(ISCA),2014 年 6 月 。 :10.1145/2678373.2665726

[62] Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” at 41st Annual International Symposium on Computer Architecture (ISCA), June 2014. doi:10.1145/2678373.2665726

[ 63 ] Mark Seaborn 和 Thomas Dullien:“利用 DRAM Rowhammer 错误来获取内核权限”,googleprojectzero.blogspot.co.uk,2015 年 3 月 9 日。

[63] Mark Seaborn and Thomas Dullien: “Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges,” googleprojectzero.blogspot.co.uk, March 9, 2015.

[ 64 ] Jim N. Gray 和 Catharine van Ingen:“磁盘故障率和错误率的经验测量”,微软研究院,MSR-TR-2005-166,2005 年 12 月。

[64] Jim N. Gray and Catharine van Ingen: “Empirical Measurements of Disk Failure Rates and Error Rates,” Microsoft Research, MSR-TR-2005-166, December 2005.

[ 65 ] Annamalai Gurusami 和 Daniel Price:“ Bug #73170:由于 Bug#68021 的修复,唯一二级索引出现重复” , bugs.mysql.com,2014 年 7 月。

[65] Annamalai Gurusami and Daniel Price: “Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021,” bugs.mysql.com, July 2014.

[ 66 ] Gary Fredericks:“ Postgres 可序列化性错误”, github.com,2015 年 9 月。

[66] Gary Fredericks: “Postgres Serializability Bug,” github.com, September 2015.

[ 67 ] 陈晓:“ HDFS DataNode Scanners 和 Disk Checker 详解”,blog.cloudera.com,2016 年 12 月 20 日。

[67] Xiao Chen: “HDFS DataNode Scanners and Disk Checker Explained,” blog.cloudera.com, December 20, 2016.

[ 68 ] Jay Kreps:“真正了解分布式系统可靠性”,blog.empathybox.com,2012 年 3 月 19 日。

[68] Jay Kreps: “Getting Real About Distributed System Reliability,” blog.empathybox.com, March 19, 2012.

[ 69 ] Martin Fowler:“ LMAX 架构”, martinfowler.com,2011 年 7 月 12 日。

[69] Martin Fowler: “The LMAX Architecture,” martinfowler.com, July 12, 2011.

[ 70 ]萨姆·斯托克斯:“充满信心地快速行动”,blog.samstokes.co.uk,2016 年 7 月 11 日。

[70] Sam Stokes: “Move Fast with Confidence,” blog.samstokes.co.uk, July 11, 2016.

[ 71 ]“ Sawtooth Lake 文档”,英特尔公司,intelledger.github.io,2016 年。

[71] “Sawtooth Lake Documentation,” Intel Corporation, intelledger.github.io, 2016.

[ 72 ]理查德·根达尔·布朗:“ R3 Corda™简介:专为金融服务设计的分布式账本”,gendal.me,2016 年 4 月 5 日。

[72] Richard Gendal Brown: “Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services,” gendal.me, April 5, 2016.

[ 73 ] Trent McConaghy、Rodolphe Marques、Andreas Müller 等人:“ BigchainDB:可扩展的区块链数据库”,bigchaindb.com,2016 年 6 月 8 日。

[73] Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “BigchainDB: A Scalable Blockchain Database,” bigchaindb.com, June 8, 2016.

[ 74 ] Ralph C. Merkle:“基于传统加密函数的数字签名”,CRYPTO '87,1987 年 8 月 。doi:10.1007/3-540-48184-2_32

[74] Ralph C. Merkle: “A Digital Signature Based on a Conventional Encryption Function,” at CRYPTO ’87, August 1987. doi:10.1007/3-540-48184-2_32

[ 75 ] Ben Laurie:“证书透明度”,ACM 队列,第 12 卷,第 8 期,第 10-19 页,2014 年 8 月 。doi:10.1145/2668152.2668154

[75] Ben Laurie: “Certificate Transparency,” ACM Queue, volume 12, number 8, pages 10-19, August 2014. doi:10.1145/2668152.2668154

[ 76 ] Mark D. Ryan:“增强的证书透明度和端到端加密邮件”,网络和分布式系统安全研讨会(NDSS),2014 年 2 月 。doi:10.14722/ndss.2014.23379

[76] Mark D. Ryan: “Enhanced Certificate Transparency and End-to-End Encrypted Mail,” at Network and Distributed System Security Symposium (NDSS), February 2014. doi:10.14722/ndss.2014.23379

[ 77 ]“软件工程道德和专业实践准则”,计算机协会, acm.org,1999。

[77] “Software Engineering Code of Ethics and Professional Practice,” Association for Computing Machinery, acm.org, 1999.

[ 78 ] François Chollet:“软件开发开始涉及重要的道德选择”,twitter.com,2016 年 10 月 30 日。

[78] François Chollet: “Software development is starting to involve important ethical choices,” twitter.com, October 30, 2016.

[ 79 ] Igor Perisic:“做出艰难的选择:机器学习中的道德追求”,engineering.linkedin.com,2016 年 11 月。

[79] Igor Perisic: “Making Hard Choices: The Quest for Ethics in Machine Learning,” engineering.linkedin.com, November 2016.

[ 80 ]约翰·诺顿:“算法编写者需要行为准则,” theguardian.com,2015 年 12 月 6 日。

[80] John Naughton: “Algorithm Writers Need a Code of Conduct,” theguardian.com, December 6, 2015.

[ 81 ] Logan Kugler:“当大数据出错时会发生什么?”,《ACM 通讯》,第 59 卷,第 6 期,第 15-16 页,2016 年 6 月。。doi:10.1145/2911975

[81] Logan Kugler: “What Happens When Big Data Blunders?,” Communications of the ACM, volume 59, number 6, pages 15–16, June 2016. doi:10.1145/2911975

[ 82 ]比尔·戴维多:“欢迎来到算法监狱 theatlantic.com,2014 年 2 月 20 日。

[82] Bill Davidow: “Welcome to Algorithmic Prison,” theatlantic.com, February 20, 2014.

[ 83 ]Don Peck:“他们在工作中看着你”,theatlantic.com,2013 年 12 月。

[83] Don Peck: “They’re Watching You at Work,” theatlantic.com, December 2013.

[ 84 ] Leigh Alexander:“算法的种族主义比人类更少吗?theguardian.com,2016 年 8 月 3 日。

[84] Leigh Alexander: “Is an Algorithm Any Less Racist Than a Human?theguardian.com, August 3, 2016.

[ 85 ] Jesse Emspak:“机器如何学习偏见”,scienceamerican.com,2016 年 12 月 29 日。

[85] Jesse Emspak: “How a Machine Learns Prejudice,” scientificamerican.com, December 29, 2016.

[ 86 ] Maciej Cegłowski:“技术的道德经济”, idlewords.com,2016 年 6 月。

[86] Maciej Cegłowski: “The Moral Economy of Tech,” idlewords.com, June 2016.

[ 87 ] Cathy O'Neil: 数学毁灭性武器:大数据如何增加不平等并威胁民主。皇冠出版,2016。ISBN:978-0-553-41881-1

[87] Cathy O’Neil: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing, 2016. ISBN: 978-0-553-41881-1

[ 88 ] Julia Angwin:“让算法负责”,nytimes.com,2016 年 8 月 1 日。

[88] Julia Angwin: “Make Algorithms Accountable,” nytimes.com, August 1, 2016.

[ 89 ] Bryce Goodman 和 Seth Flaxman:“欧盟关于算法决策和‘解释权’的规定”,arXiv:1606.08813,2016年 8 月 31 日。

[89] Bryce Goodman and Seth Flaxman: “European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’,” arXiv:1606.08813, August 31, 2016.

[ 90 ]“数据经纪行业回顾:出于营销目的收集、使用和销售消费者数据”,工作人员报告,美国参议院商业、科学和运输委员会commerce.senate.gov,2013 年 12 月。

[90] “A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes,” Staff Report, United States Senate Committee on Commerce, Science, and Transportation, commerce.senate.gov, December 2013.

[ 91 ] Olivia Solon:“ Facebook 的失败:假新闻和两极分化的政治是否让特朗普当选?theguardian.com,2016 年 11 月 10 日。

[91] Olivia Solon: “Facebook’s Failure: Did Fake News and Polarized Politics Get Trump Elected?theguardian.com, November 10, 2016.

[ 92 ] Donella H. Meadows 和 Diana Wright: 系统思考:入门。切尔西格林出版社,2008 年。ISBN:978-1-603-58055-7

[92] Donella H. Meadows and Diana Wright: Thinking in Systems: A Primer. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7

[ 93 ] Daniel J. Bernstein:“聆听‘大数据’/‘数据科学’演讲”,twitter.com,2015 年 5 月 12 日。

[93] Daniel J. Bernstein: “Listening to a ‘big data’/‘data science’ talk,” twitter.com, May 12, 2015.

[ 94 ]Marc Andreessen:“为什么软件正在吞噬世界”,《华尔街日报》,2011 年 8 月 20 日。

[94] Marc Andreessen: “Why Software Is Eating the World,” The Wall Street Journal, 20 August 2011.

[ 95 ] JM Porup:“ ‘物联网’安全性被严重破坏并且变得更糟,” arstechnica.com,2016 年 1 月 23 日。

[95] J. M. Porup: “‘Internet of Things’ Security Is Hilariously Broken and Getting Worse,” arstechnica.com, January 23, 2016.

[ 96 ] 布鲁斯·施奈尔: 数据与歌利亚:收集数据和控制世界的隐藏战斗。WWW 诺顿,2015 年。ISBN:978-0-393-35217-7

[96] Bruce Schneier: Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World. W. W. Norton, 2015. ISBN: 978-0-393-35217-7

[ 97 ] Grugq:“没什么可隐藏的”, grugq.tumblr.com,2016 年 4 月 15 日。

[97] The Grugq: “Nothing to Hide,” grugq.tumblr.com, April 15, 2016.

[ 98 ] Tony Beltramelli:“深度间谍:使用智能手表和深度学习进行间谍活动”,哥本哈根 IT 大学硕士论文,2015 年 12 月。可在 arxiv.org/abs/1512.05616获取

[98] Tony Beltramelli: “Deep-Spying: Spying Using Smartwatch and Deep Learning,” Masters Thesis, IT University of Copenhagen, December 2015. Available at arxiv.org/abs/1512.05616

[ 99 ] Shoshana Zuboff:“大他者:监视资本主义和信息文明的前景”,《信息技术杂志》,第 30 卷,第 1 期,第 75-89 页,2015 年 4 月 。doi:10.1057/jit.2015.5

[99] Shoshana Zuboff: “Big Other: Surveillance Capitalism and the Prospects of an Information Civilization,” Journal of Information Technology, volume 30, number 1, pages 75–89, April 2015. doi:10.1057/jit.2015.5

[ 100 ] Carina C. Zona:“ Consequences of an Insightful Algorithm ”,GOTO Berlin,2016 年 11 月。

[100] Carina C. Zona: “Consequences of an Insightful Algorithm,” at GOTO Berlin, November 2016.

[ 101 ] Bruce Schneier:“数据是有毒资产,为什么不扔掉它呢?”,schneier.com,2016 年 3 月 1 日。

[101] Bruce Schneier: “Data Is a Toxic Asset, So Why Not Throw It Out?,” schneier.com, March 1, 2016.

[ 102 ] John E. Dunn:“英国 15 起最臭名昭著的数据泄露事件”,techworld.com,2016 年 11 月 18 日。

[102] John E. Dunn: “The UK’s 15 Most Infamous Data Breaches,” techworld.com, November 18, 2016.

[ 103 ] Cory Scott:“数据不是有毒的——这意味着没有任何好处——而是有害的材料,我们必须平衡需求与想要,” twitter.com,2016 年 3 月 6 日。

[103] Cory Scott: “Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want,” twitter.com, March 6, 2016.

[ 104 ] 布鲁斯·施奈尔:“任务蔓延:当一切都是恐怖主义时”,schneier.com,2013 年 7 月 16 日。

[104] Bruce Schneier: “Mission Creep: When Everything Is Terrorism,” schneier.com, July 16, 2013.

[ 105 ] Lena Ulbricht 和 Maximilian von Grafenstein:“大数据:权力大转移?”,《互联网政策评论》,第 5 卷,第 1 期,2016 年 3 月 。doi:10.14763/2016.1.406

[105] Lena Ulbricht and Maximilian von Grafenstein: “Big Data: Big Power Shifts?,” Internet Policy Review, volume 5, number 1, March 2016. doi:10.14763/2016.1.406

[ 106 ] Ellen P. Goodman 和 Julia Powles:“ Facebook 和 Google:我们所知道的最强大和最神秘的帝国”,theguardian.com,2016 年 9 月 28 日。

[106] Ellen P. Goodman and Julia Powles: “Facebook and Google: Most Powerful and Secretive Empires We’ve Ever Known,” theguardian.com, September 28, 2016.

[ 107 ]关于在个人数据处理和此类数据自由流动方面保护个人的指令 95/46/EC,欧洲共同体官方公报,第 L 281/31 号, eur-lex.europa。欧洲联盟,1995 年 11 月。

[107] Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data, Official Journal of the European Communities No. L 281/31, eur-lex.europa.eu, November 1995.

[ 108 ] Brendan Van Alsenoy:“监管数据保护:参与个人数据处理的参与者之间的责任和风险分配”,论文,鲁汶大学 IT 和知识产权法中心,2016 年 8 月。

[108] Brendan Van Alsenoy: “Regulating Data Protection: The Allocation of Responsibility and Risk Among Actors Involved in Personal Data Processing,” Thesis, KU Leuven Centre for IT and IP Law, August 2016.

[ 109 ] Michiel Rhoen:“超越同意:通过消费者保护法改善数据保护”,《互联网政策评论》,第 5 卷,第 1 期,2016 年 3 月 。doi:10.14763/2016.1.404

[109] Michiel Rhoen: “Beyond Consent: Improving Data Protection Through Consumer Protection Law,” Internet Policy Review, volume 5, number 1, March 2016. doi:10.14763/2016.1.404

[ 110 ] Jessica Leber:“您的数据足迹正在以您无法想象的方式影响您的生活”,fastcoexist.com,2016 年 3 月 15 日。

[110] Jessica Leber: “Your Data Footprint Is Affecting Your Life in Ways You Can’t Even Imagine,” fastcoexist.com, March 15, 2016.

[ 111 ] Maciej Cegłowski:“被数据困扰”,idlewords.com,2015 年 10 月。

[111] Maciej Cegłowski: “Haunted by Data,” idlewords.com, October 2015.

[ 112 ] Sam Thielman:“你不是你读到的那样:图书馆员清除用户数据以保护隐私”,theguardian.com,2016 年 1 月 13 日。

[112] Sam Thielman: “You Are Not What You Read: Librarians Purge User Data to Protect Privacy,” theguardian.com, January 13, 2016.

[ 113 ] Conor Friedersdorf:“爱德华·斯诺登泄密的其他动机”,theatlantic.com,2014 年 5 月 13 日。

[113] Conor Friedersdorf: “Edward Snowden’s Other Motive for Leaking,” theatlantic.com, May 13, 2014.

[ 114 ] Phillip Rogaway:“密码学工作的道德品质”,Crypology ePrint 2015/1162,2015 年 12 月。

[114] Phillip Rogaway: “The Moral Character of Cryptographic Work,” Cryptology ePrint 2015/1162, December 2015.

词汇表

Glossary

笔记

请注意,本术语表中的定义简短明了,旨在传达核心思想,而不是术语的全部微妙之处。欲了解更多详细信息,请参阅正文中的参考文献。

Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.

异步
asynchronous

不等待某件事完成(例如,通过网络将数据发送到另一个节点),也不对需要多长时间做出任何假设。请参阅“同步与异步复制”“同步与异步网络”“系统模型与现实”

Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See “Synchronous Versus Asynchronous Replication”, “Synchronous Versus Asynchronous Networks”, and “System Model and Reality”.

原子
atomic
  1. 在并发操作的上下文中:描述一个似乎在单个时间点生效的操作,因此另一个并发进程永远不会遇到处于“半完成”状态的操作。另请参阅隔离

  2. 在事务的上下文中:将一组写入分组在一起,即使发生故障,这些写入也必须全部提交或全部回滚。请参阅“原子性”“原子提交和两阶段提交(2PC)”

  1. In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also isolation.

  2. In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See “Atomicity” and “Atomic Commit and Two-Phase Commit (2PC)”.

背压
backpressure

迫使某些数据的发送方放慢速度,因为接收方无法跟上。也称为流量控制。请参阅“消息系统”

Forcing the sender of some data to slow down because the recipient cannot keep up with it. Also known as flow control. See “Messaging Systems”.

批处理
batch process

一种计算,它采用一些固定(通常是很大)的数据集作为输入,并生成一些其他数据作为输出,而不修改输入。参见第 10 章

A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See Chapter 10.

有界的
bounded

具有一些已知的上限或大小。例如,用于网络延迟(请参阅 “超时和无限延迟” )和数据集(请参阅第 11 章的介绍)的上下文中。

Having some known upper limit or size. Used for example in the context of network delay (see “Timeouts and Unbounded Delays”) and datasets (see the introduction to Chapter 11).

拜占庭断层
Byzantine fault

以某种任意方式表现不正确的节点,例如向其他节点发送矛盾或恶意消息。请参阅“拜占庭错误”

A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See “Byzantine Faults”.

缓存
cache

记住最近使用的数据以加快将来读取相同数据的组件。它通常是不完整的:因此,如果缓存中缺少某些数据,则必须从具有完整数据副本的某些底层、速度较慢的数据存储系统中获取数据。

A component that remembers recently used data in order to speed up future reads of the same data. It is generally not complete: thus, if some data is missing from the cache, it has to be fetched from some underlying, slower data storage system that has a complete copy of the data.

CAP定理
CAP theorem

一个被广泛误解的理论结果,在实践中没有用处。参见 “CAP 定理”

A widely misunderstood theoretical result that is not useful in practice. See “The CAP theorem”.

因果关系
causality

当系统中一件事“先于”另一件事发生时,事件之间就会产生依赖关系。例如,响应于较早事件的较晚事件,或者建立在较早事件之上,或者应该根据较早事件来理解。请参阅 ““发生在”之前的关系和并发”“顺序和因果关系”

The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See “The “happens-before” relationship and concurrency” and “Ordering and Causality”.

共识
consensus

分布式计算中的一个基本问题,涉及让多个节点就某些事情达成一致(例如,哪个节点应该是数据库集群的领导者)。这个问题比乍一看要困难得多。参见“容错共识”

A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See “Fault-Tolerant Consensus”.

数据仓库
data warehouse

一个数据库,其中来自多个不同 OLTP 系统的数据已组合并准备用于分析目的。请参阅“数据仓库”

A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See “Data Warehousing”.

陈述性的
declarative

描述某事物应具有的属性,但不描述如何实现它的确切步骤。在查询上下文中,查询优化器采用声明性查询并决定如何最好地执行它。请参阅“数据查询语言”

Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of queries, a query optimizer takes a declarative query and decides how it should best be executed. See “Query Languages for Data”.

非规范化
denormalize

在标准化数据 集中引入一定量的冗余或重复(通常以缓存索引的形式),以加快读取速度。非规范化值是一种预先计算的查询结果,类似于物化视图。请参阅“单对象和多对象操作”“从同一事件日志导出多个视图”

To introduce some amount of redundancy or duplication in a normalized dataset, typically in the form of a cache or index, in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See “Single-Object and Multi-Object Operations” and “Deriving several views from the same event log”.

派生数据
derived data

通过可重复过程从其他数据创建的数据集,如有必要,您可以再次运行该过程。通常,需要派生数据来加速对数据的特定类型的读取访问。索引、缓存和物化视图都是派生数据的示例。参见第三部分的介绍。

A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See the introduction to Part III.

确定性的
deterministic

描述一个函数,如果给它相同的输入,它总是产生相同的输出。这意味着它不能依赖于随机数、一天中的时间、网络通信或其他不可预测的事物。

Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things.

分散式
distributed

在通过网络连接的多个节点上运行。部分故障的特点是:系统的某些部分可能损坏,而其他部分仍在工作,并且软件通常无法知道到底损坏了什么。请参阅“故障和部分故障”

Running on several nodes connected by a network. Characterized by partial failures: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See “Faults and Partial Failures”.

耐用的
durable

以您相信即使发生各种故障也不会丢失的方式存储数据。参见 “耐久性”

Storing data in a way such that you believe it will not be lost, even if various faults occur. See “Durability”.

ETL
ETL

提取-转换-加载。从源数据库中提取数据,将其转换为更适合分析查询的形式,并将其加载到数据仓库或批处理系统中的过程。请参阅“数据仓库”

Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See “Data Warehousing”.

故障转移
failover

在具有单个领导者的系统中,故障转移是将领导角色从一个节点转移到另一个节点的过程。请参阅“处理节点中断”

In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See “Handling Node Outages”.

容错
fault-tolerant

如果出现问题(例如,如果机器崩溃或网络链接失败),能够自动恢复。参见“可靠性”

Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See “Reliability”.

流量控制
flow control

参见背压

See backpressure.

追随者
follower

副本不直接接受来自客户端的任何写入,而仅处理从领导者接收到的数据更改。也称为辅助从属只读副本热备用。参见“领导者和追随者”

A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a secondary, slave, read replica, or hot standby. See “Leaders and Followers”.

全文检索
full-text search

通过任意关键字搜索文本,通常具有附加功能,例如匹配拼写相似的单词或同义词。全文索引是一种支持此类查询的二级索引。请参阅“全文搜索和模糊索引”

Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of secondary index that supports such queries. See “Full-text search and fuzzy indexes”.

图形
graph

一种数据结构,由顶点(可以引用的事物,也称为节点实体)和(从一个顶点到另一个顶点的连接,也称为关系)组成。请参阅“类图数据模型”

A data structure consisting of vertices (things that you can refer to, also known as nodes or entities) and edges (connections from one vertex to another, also known as relationships or arcs). See “Graph-Like Data Models”.

散列
hash

将输入转换为看似随机的数字的函数。相同的输入始终返回与输出相同的数字。两个不同的输入很可能有两个不同的数字作为输出,尽管两个不同的输入有可能产生相同的输出(这称为冲突。请参阅“按密钥哈希进行分区”

A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a collision). See “Partitioning by Hash of Key”.

幂等的
idempotent

描述可以安全重试的操作;如果执行多次,则与只执行一次效果相同。参见“幂等性”

Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See “Idempotence”.

指数
index

一种数据结构,可让您有效地搜索在特定字段中具有特定值的所有记录。请参阅“为数据库提供支持的数据结构”

A data structure that lets you efficiently search for all records that have a particular value in a particular field. See “Data Structures That Power Your Database”.

隔离
isolation

在事务的上下文中,描述并发执行的事务可以相互干扰的程度。可串行隔离提供了最强的保证,但也使用较弱的隔离级别。参见“隔离”

In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. Serializable isolation provides the strongest guarantees, but weaker isolation levels are also used. See “Isolation”.

加入
join

将具有共同点的记录放在一起。最常用于以下情况:一条记录​​引用另一条记录(外键、文档引用、图中的边)并且查询需要获取引用指向的记录。请参阅“多对一和多对多关系”“减少端连接和分组”

To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See “Many-to-One and Many-to-Many Relationships” and “Reduce-Side Joins and Grouping”.

领导者
leader

当数据或服务跨多个节点复制时,领导者是允许进行更改的指定副本。领导者可以通过某种协议选举出来,或者由管理员手动选择。也称为初级大师。参见“领导者和追随者”

When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the primary or master. See “Leaders and Followers”.

线性化
linearizable

表现得好像系统中只有一个数据副本,该副本通过原子操作进行更新。请参阅“线性化”

Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See “Linearizability”.

地点
locality

性能优化:如果同时经常需要多条数据,则将它们放在同一个地方。请参阅“查询的数据局部性”

A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See “Data locality for queries”.

lock

一种确保只有一个线程、节点或事务可以访问某事物的机制,任何其他想要访问同一事物的人都必须等待直到锁被释放。请参阅 “两相锁定 (2PL)”“引导器和锁”

A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” and “The leader and the lock”.

日志
log

用于存储数据的仅附加文件。预写日志用于使存储引擎能够抵御崩溃(请参阅“使 B 树可靠”),日志结构存储引擎使用日志作为其主要存储格式(请参阅“SSTables 和 LSM-Trees”),复制日志用于将写入从领导者复制到追随者(请参阅“领导者和追随者”),事件日志可以表示数据流(请参阅“分区日志”)。

An append-only file for storing data. A write-ahead log is used to make a storage engine resilient against crashes (see “Making B-trees reliable”), a log-structured storage engine uses logs as its primary storage format (see “SSTables and LSM-Trees”), a replication log is used to copy writes from a leader to followers (see “Leaders and Followers”), and an event log can represent a data stream (see “Partitioned Logs”).

物化
materialize

急切地执行计算并写出结果,而不是根据请求按需计算。请参阅“聚合:数据立方体和物化视图”“中间状态的物化”

To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See “Aggregation: Data Cubes and Materialized Views” and “Materialization of Intermediate State”.

节点
node

计算机上运行的某些软件的实例,它通过网络与其他节点通信以完成某些任务。

An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.

归一化
normalized

以不存在冗余或重复的方式构建。在规范化数据库中,当某条数据发生更改时,只需在一个地方更改它,而不需要在许多不同的地方复制很多份。请参阅“多对一和多对多关系”

Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See “Many-to-One and Many-to-Many Relationships”.

联机分析处理
OLAP

在线分析处理。访问模式的特征是对大量记录进行聚合(例如,计数、总和、平均值)。请参阅“事务处理还是分析?”

Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See “Transaction Processing or Analytics?”.

联机事务处理
OLTP

在线交易处理。访问模式的特点是快速查询,读取或写入少量记录,通常按键索引。请参阅“事务处理还是分析?”

Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See “Transaction Processing or Analytics?”.

划分
partitioning

将对于单台机器来说太大的大型数据集或计算拆分为更小的部分,并将它们分布在多台机器上。也称为分片。参见 第 6 章

Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as sharding. See Chapter 6.

百分位数
percentile

一种通过计算有多少值高于或低于某个阈值来测量值分布的方法。例如,某个时间段内第 95 个百分位数的响应时间是时间t ,因此该时间段内 95% 的请求在t以内完成,5% 的请求时间长于t。请参阅 “描述性能”

A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time t such that 95% of requests in that period complete in less than t, and 5% take longer than t. See “Describing Performance”.

首要的关键
primary key

唯一标识记录的值(通常是数字或字符串)。在许多应用中,主键是在创建记录时由系统生成的(例如,顺序或随机);它们通常不是由用户设置的。另请参见二级索引

A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also secondary index.

法定人数
quorum

在操作被视为成功之前需要对操作进行投票的最小节点数。请参阅“阅读和写作的法定人数”

The minimum number of nodes that need to vote on an operation before it can be considered successful. See “Quorums for reading and writing”.

重新平衡
rebalance

将数据或服务从一个节点移动到另一个节点以公平地分散负载。请参阅 “重新平衡分区”

To move data or services from one node to another in order to spread the load fairly. See “Rebalancing Partitions”.

复制
replication

在多个节点上保留相同数据的副本(副本),以便在某个节点无法访问时仍然可以访问该数据。参见第 5 章

Keeping a copy of the same data on several nodes (replicas) so that it remains accessible if a node becomes unreachable. See Chapter 5.

图式
schema

对某些数据结构的描述,包括其字段和数据类型。某些数据是否符合模式可以在数据生命周期的不同点进行检查(请参阅 “文档模型中的模式灵活性”),并且模式可以随着时间的推移而改变(请参阅第 4 章)。

A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see “Schema flexibility in the document model”), and a schema can change over time (see Chapter 4).

二级索引
secondary index

与主数据存储一起维护的附加数据结构,允许您有效地搜索与某种条件匹配的记录。请参阅 “其他索引结构”“分区和二级索引”

An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See “Other Indexing Structures” and “Partitioning and Secondary Indexes”.

可序列化
serializable

保证如果多个事务同时执行,它们的行为就像按某种串行顺序一次执行一个事务一样。请参阅“可串行化”

A guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See “Serializability”.

无共享
shared-nothing

与共享内存或共享磁盘架构相反,独立节点(每个节点都有自己的 CPU、内存和磁盘)通过传统网络连接的架构。参见第二部分的介绍。

An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See the introduction to Part II.

倾斜
skew
  1. 分区之间的负载不平衡,例如某些分区有大量请求或数据,而其他分区则少得多。也称为热点。请参阅“倾斜工作负载和缓解热点”“处理倾斜”

  2. 导致事件以意外、无序的顺序出现的时间异常。请参阅“快照隔离和可重复读取”中的读取偏差讨论,“写入偏差和幻像”中的 写入偏差讨论,以及“排序事件的时间戳”中的时钟偏差讨论

  1. Imbalanced load across partitions, such that some partitions have lots of requests or data, and others have much less. Also known as hot spots. See “Skewed Workloads and Relieving Hot Spots” and “Handling skew”.

  2. A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of read skew in “Snapshot Isolation and Repeatable Read”, write skew in “Write Skew and Phantoms”, and clock skew in “Timestamps for ordering events”.

裂脑
split brain

两个节点同时认为自己是领导者的场景,这可能会导致系统保证被违反。请参阅“处理节点中断”“事实是由多数人决定的”

A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See “Handling Node Outages” and “The Truth Is Defined by the Majority”.

存储过程
stored procedure

一种对事务逻辑进行编码的方法,使其可以完全在数据库服务器上执行,而无需在事务期间与客户端来回通信。请参阅 “实际串行执行”

A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See “Actual Serial Execution”.

流式处理
stream process

持续运行的计算,消耗永无止境的事件流作为输入,并从中导出一些输出。参见第 11 章

A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11.

同步
synchronous

与异步 相反。

The opposite of asynchronous.

记录系统
system of record

保存某些数据的主要、权威版本(也称为事实来源)的系统。更改首先写入此处,其他数据集可能来自记录系统。参见第三部分的介绍。

A system that holds the primary, authoritative version of some data, also known as the source of truth. Changes are first written here, and other datasets may be derived from the system of record. See the introduction to Part III.

暂停
timeout

检测故障最简单的方法之一,即观察一段时间内是否没有响应。但是,无法知道超时是由于远程节点问题还是网络问题造成的。请参阅“超时和无限延迟”

One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays”.

总订单
total order

一种比较事物(例如时间戳)的方法,使您始终可以说出两件事中哪一个更大,哪一个更小。有些事物是不可比较的(不能说哪个更大或更小)的排序称为偏。请参阅 “因果顺序不是全序”

A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a partial order. See “The causal order is not a total order”.

交易
transaction

将多个读取和写入组合成一个逻辑单元,以简化错误处理和并发问题。参见第 7 章

Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See Chapter 7.

两阶段提交 (2PC)
two-phase commit (2PC)

一种确保多个数据库节点全部提交或全部中止事务的算法。请参阅“原子提交和两阶段提交 (2PC)”

An algorithm to ensure that several database nodes either all commit or all abort a transaction. See “Atomic Commit and Two-Phase Commit (2PC)”.

两相锁定 (2PL)
two-phase locking (2PL)

一种用于实现可序列化隔离的算法,其工作原理是事务获取其读取或写入的所有数据的锁,并保持该锁直到事务结束。请参阅 “两相锁定 (2PL)”

An algorithm for achieving serializable isolation that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See “Two-Phase Locking (2PL)”.

无界的
unbounded

没有任何已知的上限或大小。与有界相反。

Not having any known upper limit or size. The opposite of bounded.

指数

Index

A

A

B

C

C

D

D

E

F

F

G

G

H

H

I

J

J

K

K

L

L

中号

M

N

O

P

Q

R

S

S

时间

T

U

U

V

V

W

X

X

Z

Z

关于作者

About the Author

Martin Kleppmann是英国剑桥大学分布式系统研究员。此前,他是 LinkedIn 和 Rapportive 等互联网公司的软件工程师和企业家,负责大规模数据基础设施的工作。在这个过程中,他通过艰难的方式学到了一些东西,他希望这本书能让你避免重蹈覆辙。

Martin Kleppmann is a researcher in distributed systems at the University of Cambridge, UK. Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes.

Martin 是一名定期会议发言人、博主和开源贡献者。他认为深刻的技术思想应该为每个人所接受,更深入的理解将帮助我们开发出更好的软件。

Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software.

版画

Colophon

《设计数据密集型应用程序》封面上的动物是印度野猪 ( Sus scrofa cristatus ),它是在印度、缅甸、尼泊尔、斯里兰卡和泰国发现的野猪亚种。它们与欧洲野猪的不同之处在于,它们的背部有较高的刚毛,没有毛茸茸的底毛,头骨更大、更直。

The animal on the cover of Designing Data-Intensive Applications is an Indian wild boar (Sus scrofa cristatus), a subspecies of wild boar found in India, Myanmar, Nepal, Sri Lanka, and Thailand. They are distinctive from European boars in that they have higher back bristles, no woolly undercoat, and a larger, straighter skull.

印度野猪有一层灰色或黑色的毛发,脊柱上有坚硬的鬃毛。雄性有突出的犬齿(称为犬齿),用于与对手战斗或抵御掠食者。雄性比雌性大,但该物种平均肩高 33-35 英寸,体重 200-300 磅。它们的天敌包括熊、老虎和各种大型猫科动物。

The Indian wild boar has a coat of gray or black hair, with stiff bristles running along the spine. Males have protruding canine teeth (called tushes) that are used to fight with rivals or fend off predators. Males are larger than females, but the species averages 33–35 inches tall at the shoulder and 200–300 pounds in weight. Their natural predators include bears, tigers, and various big cats.

这些动物是夜间活动的杂食性动物,它们吃各种各样的东西,包括根、昆虫、腐肉、坚果、浆果和小动物。众所周知,野猪还会在垃圾和农田中翻土,造成大量破坏并招致农民的敌意。他们每天需要摄入 4,000-4,500 卡路里的热量。野猪有发达的嗅觉,这有助于它们寻找地下植物材料和穴居动物。然而,他们的视力很差。

These animals are nocturnal and omnivorous—they eat a wide variety of things, including roots, insects, carrion, nuts, berries, and small animals. Wild boars are also known to root through garbage and crop fields, causing a great deal of destruction and earning the enmity of farmers. They need to eat 4,000–4,500 calories a day. Boars have a well-developed sense of smell, which helps them forage for underground plant material and burrowing animals. However, their eyesight is poor.

野猪在人类文化中长期以来一直具有重要意义。在印度教传说中,野猪是毗湿奴神的化身。在古希腊的丧葬纪念碑中,它是勇敢的失败者的象征(与胜利的狮子形成鲜明对比)。由于其侵略性,它被描绘在斯堪的纳维亚、日耳曼和盎格鲁-撒克逊战士的盔甲和武器上。在十二生肖中,它象征着决心和冲动。

Wild boars have long held significance in human culture. In Hindu lore, the boar is an avatar of the god Vishnu. In ancient Greek funerary monuments, it was a symbol of a gallant loser (in contrast to the victorious lion). Due to its aggression, it was depicted on the armor and weapons of Scandinavian, Germanic, and Anglo-Saxon warriors. In the Chinese zodiac, it symbolizes determination and impetuosity.

奥莱利封面上的许多动物都濒临灭绝。所有这些对世界都很重要。要了解有关如何提供帮助的更多信息,请访问animals.oreilly.com

Many of the animals on O’Reilly covers are endangered; all of them are important to the world. To learn more about how you can help, go to animals.oreilly.com.

封面图片来自Shaw's Zoology。封面字体为 URW Typewriter 和 Guardian Sans。文字字体为Adobe Minion Pro;图表中的字体为 Adob​​e Myriad Pro;标题字体为 Adob​​e Myriad Condensed;代码字体是Dalton Maag的Ubuntu Mono。

The cover image is from Shaw’s Zoology. The cover fonts are URW Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the font in diagrams is Adobe Myriad Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.